Statistics in geophysics linear regression

Regression to the meanFigure: Scatterplot of mid-parental height against child’s height, and regression line dark red line... Other names frequently seen are: Predictor variable: input v

Trang 1

Statistics in Geophysics: Linear Regression

Steffen Unkel

Department of Statistics Ludwig-Maximilians-University Munich, Germany

Trang 2

Historical remarks

Sir Francis Galton (1822-1911) was responsible for the

introduction of the word “regression”

Galton, F (1886): Regression towards mediocrity in hereditarystature, The Journal of the Anthropological Institute of GreatBritain and Ireland, Vol 15, pp 246-263

Regression equation:

ˆ

y = ¯y +2

3(x − ¯x ) ,where y denotes the height of the child and x is a weightedaverage of the mother’s and father’s heights

Trang 3

Regression to the mean

Figure: Scatterplot of mid-parental height against child’s height, and regression line (dark red line).

Trang 4

Relationship between two variables

We can distinguishpredictor variables andresponse variables

Other names frequently seen are:

Predictor variable: input variable, X -variable, regressor, covariate, independent variable.

Response variable: output variable, predictand, Y -variable, dependent variable.

We shall be interested in finding out how changes in thepredictor variables affect the values of a response variable

Trang 5

Relationship between two variables: Example

Trang 6

In simple (multiple) linear regressionone (two or more)predictor variable(s) is (are) assumed to affect the values of aresponse variable in a linear fashion

For the model ofsimple linear regression, we assume

= β0+ β1x + ,

and is the random error term

Inserting the data yields the n equations

yi = β0+ β1xi + i , i = 1, , nwith unknown regression coefficients β0 and β1

Trang 7

covariates, that is, f is linear in the parameters

2 Additivity of errors

3 The error terms i (i = 1 , n) are random variables withE(i) = 0 and constant variance σ2 (unknown), that is,

homoscedastic errors with Var(i) = σ2

Cov(i, j) = 0 for i 6= j

i ∼ N (0, σ2)

Trang 8

Least squares (LS) fitting

The estimated values ˆβ0 and ˆβ1 are determined as minimizers

of the sum of squares deviations

β0 = ¯y − ˆβ1¯x

Trang 9

Least squares (LS) fitting II

An estimate for the error variance σ2, called theresidualvariance, is

Trang 10

Figure: Minimum temperature (◦F) observations at Ithaca and

Canandaigua, New York, for January 1987, with fitted least squares line

Trang 11

the regression line?

Consider the identity

Trang 12

Coefficient of determination

Some of the variation in the data (SST) can be ascribed tothe regression line (SSR) and some to the fact that the actualobservations do not all lie on the regression line (SSE)

A useful statistic to check is the R2 value (coefficient ofdetermination):

i =1(yi − ¯y )2 = 1−SSE

for which it holds that 0 ≤ R2 ≤ 1 and which is often

expressed as a percentage by multiplying by 100

The square root of R2 is (the absolute value) of the Pearsoncorrelation between x and y

Trang 13

ANOVA table for simple linear regression

Source Degrees of freedom Sum of squares Mean square F -value

Trang 14

F-test for significance of regression

Suppose that the errors i are independent N (0, σ2) variables.Then it can be shown that if β1= 0, the ratio

Statistical test: H0: β1 = 0 versus H1: β1 6= 0

We compare the F -value with the 100(1 − α)% point of thetabulated F (1, n − 2)-distribution in order to determine

data we have seen

Trang 15

Confidence intervals

(1 − α) × 100% confidence intervals for β0 and β1:

[ ˆβj ± ˆσβˆ

j× t1−α/2(n − 2)] , j = 0, 1 ,where

For sufficiently large n: Replace quantiles of the

t(n − 2)-distribution by quantiles of the N (0, 1)-distribution

Trang 16

Hypothesis tests

Example: Two-sided test for β1:

H0 : β1= 0 H1 : β1 6= 0 Observed test statistic:

t = βˆ1− 0ˆ

σβˆ

1

= βˆ1ˆ

Trang 17

Prediction intervals

A prediction intervalfor a future observation y0 at a location

x0 with level (1 − α) is given by

(x − ¯x )2

i =1(xi − ¯x )2

Trang 18

Prediction intervals: Example

+ + + + +

+

+ +

+

+ +

+

+ +

+ + +

+ + + + +

+

+ +

Trang 19

Residuals versus fitted values

Trang 20

If successive residuals are positively (negatively) seriallycorrelated, d will be near 0 (near 4).

The distribution of d is symmetric around 2

The critical values for Durbin-Watson tests vary depending onthe sample size and the number of predictor variables

Trang 21

Durbin-Watson test II

Compare d (or 4 − d , whichever is closer to zero) with thetabulated critical values dL and dU

If d < dL, conclude that positive serial correlation is a

possibility; if d > dU, conclude that no serial correlation isindicated

If 4 − d < dL, conclude that negative serial correlation is apossibility; if 4 − d > dU, conclude that no serial correlation isindicated

If the d (or 4 − d ) value lies between dL and dU, the test isinconclusive

Trang 22

Durbin-Watson test: Example

Trang 23

If all the points lie on such a line, more or less, one wouldconclude that the residuals do not deny the assumption ofnormality of errors.

Trang 24

Quantile-quantile plot: Example

Figure: Gaussian Q-Q plot of the residuals obtained from the regression

of the January 1987 temperature data.

Định dạng
Số trang	24
Dung lượng	372,11 KB