1. Trang chủ
  2. » Khoa Học Tự Nhiên

Chap 14: Simple linear regression and correlation

75 811 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Simple linear regression and correlation
Trường học Standard University
Chuyên ngành Statistics
Thể loại Bài luận
Năm xuất bản 2023
Thành phố Standard City
Định dạng
Số trang 75
Dung lượng 2,51 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Introduction Our problem objective is to analyse the relationship between numerical variables; regression analysis is the first tool we will study. Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables). Dependent variable: denoted Y Independent variables: denoted X1, X2, …, Xk Correlation Analysis… If we are interested only in determining whether a relationship exists, we employ correlation analysis, a technique introduced earlier. This chapter will examine the relationship between two variables, sometimes called simple linear regression. Mathematical equations describing these relationships are also called models, and they fall into two types: deterministic or probabilistic.

Trang 2

Chapter 14

Simple linear regression and correlation

Trang 3

Our problem objective is to analyse the

relationship between numerical variables;

regression analysis is the first tool we will study

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis

of other variables (the independent variables)

Dependent variable: denoted Y

Independent variables: denoted X 1 , X 2 , …, X k

Trang 4

Correlation Analysis…

If we are interested only in determining whether a

a technique introduced earlier.

This chapter will examine the relationship between

regression.

Mathematical equations describing these relationships

are also called models, and they fall into two types:

deterministic or probabilistic

4

Trang 5

Model Types…

Deterministic Model: an equation or set of equations

that allow us to fully determine the value of the

dependent variable from the values of the independent variables.

Contrast this with…

Probabilistic Model: a method used to capture the

randomness that is part of a real-life process.

E.g do all houses of the same size (measured in square metre) sell for exactly the same price?

Trang 6

A Model…

To create a probabilistic model, we start with a

deterministic model that approximates the

relationship we want to model and add a random term that measures the error of the deterministic component

Trang 7

A model of the relationship between house size (independent variable) and house price (dependent variable) would be:

House size

House price

Most lots sell

Trang 8

In real life, however, the house cost will vary even among the same size of house:

Same house size, but different price points

(e.g décor options, cabinet upgrades, lot location…).

House price

200K$

Lower vs higher variability

House price = 200 000 + 800(Size) + 

8

A Model…

Trang 9

Random Term…

We now represent the price of a house as a function of its size in this probabilistic model:

y = 200 000 + 800x + 

where  (Greek letter epsilon) is the random term (a.k.a

error variable) It is the difference between the actual

selling price and the estimated price based on the size of the house Its value will vary from house sale to house sale,

even if the area of the house (i.e x) remains the same due

to other factors such as the location, age, décor etc of the house.

Trang 10

14.1 Simple Linear Regression Model

A straight line model with one independent

variable is called a first order linear model or

a simple linear regression model It is written

Trang 11

Simple Linear Regression Model…

=y-intercept

Trang 12

In much the same way we base estimates of µ on x,

we estimate β0 using and β1 using , the y-intercept

and slope (respectively) of the least squares or

regression line given by:

(Recall: this is an application of the least squares method and it produces a straight line that

minimises the sum of the squared differences

between the points and the line)

14.2 Estimating the Coefficients

x ˆ ˆ

yˆ  ˆ  0 ˆ 1x

yˆ   0 1

12

Trang 13

Least Squares Method

The question is:

• Which straight line fits best?

• The least squares line minimises the sum of squared differences between the points and the line

Trang 14

3 3

The best line is the one that minimises the sum of squared vertical differences between the points and the line

1

4

(1, 2)

2 2

Line 2

Line 1

Least Squares Method…

Trang 15

To calculate the estimates of the coefficients that

minimise the differences between the data points

and the line, use the formulas:

x

ˆ y

ˆ

x n x

y x n y

x

or n

) x

( x

n

) y x

( y

x ˆ

1 0

2

2 i

i

i 2

i

2 i

i

i i

i 1

ˆ

x n x

y x n y

x

or n

) x

( x

n

) y x

( y

x ˆ

1 0

2

2 i

i

i 2

i

2 i

i

i i

i 1

Trang 16

x n

) y x

( y

x SS

y n

y n

) y

( y

SS

x n

x n

) x

( x

SS

i i

i

i i

i xy

2

2 i

2 i

2 i y

2

2 i

2 i

2 i x

x n

) y x

( y

x SS

y n

y n

) y

( y

SS

x n

x n

) x

( x

SS

i i

i

i i

i xy

2

2 i

2 i

2 i y

2

2 i

2 i

2 i x

Least Squares Estimates…

Trang 17

ˆ y

ˆ

SS

SS ˆ

1 0

x

xy 1

ˆ

SS

SS ˆ

1 0

x

xy 1

The estimated simple linear regression equation that

estimates the equation of the first-order linear model is:

x ˆ ˆ

yˆ  ˆ   ˆ  x

yˆ    

Least Squares Estimates…

Trang 18

A car dealer wants to find the

relationship between the

odometer reading and the

selling price of used cars

A random sample of 100 cars

is selected and the data are

recorded in file XM21-03

Find the regression line

Car Odometer Price

Trang 19

To calculate and we need to calculate

several statistics first:

; y

; x

24 16

01 36

where n = 100

x

x

ˆ ˆ

0  1  19 611  0 094

-403.6207

307.378 4

( y

x SS

n

) x

( x

SS

i

i i

i xy

i i

x

2 2

19.611 )

)(

(

-0.0937 307.378

4

403.6207 -

24 16

1 0

1

.

x

ˆ y ˆ

SS

SS ˆ

x xy

Trang 20

Data > Data Analysis > Regression >

[Highlight the data y range and x range] > OK

.

19 611 0 094

Trang 21

This is the slope of the line.

For each additional mile on the odometer, the price decreases by an average of $0.094

x

Do not interpret the intercept as the

‘price of cars that have not been driven’.

10 12 14 16 18 20

Trang 22

– The mean of  is zero: E() = 0.

– The standard deviation of  is a constant () for all values of x

– The errors are independent

– The errors are independent of the independent variable x

– The probability distribution of  is normal

Trang 23

From the first three assumptions we have:

y is normally distributed with mean E(y) = 0 + 1x and a constant standard deviation 

From the first three assumptions we have:

y is normally distributed with mean E(y) = 0 + 1x and a constant standard deviation 

The standard deviation remains constant

… but the mean value changes with x.

Trang 24

14.4 Assessing the Model

• The least squares method will produce a regression line whether or not there is a linear relationship between x and y

• Consequently, it is important to assess how well the linear model fits the data

• Several methods are used to assess the model:

– testing and/or estimating the regression model

coefficients

– using descriptive measurements such as the sum

of squares for errors (SSE).

Trang 25

– This is the sum of differences between the

points and the regression line

– It can serve as a measure of how well the

line fits the data

– The sum of squares for errors is calculated as

– This statistic plays a role in every statistical technique we employ to assess the model

x

xy y

SS

SS SS

SS

SS SS

SSE

2

 OR

)

i

i y y

SSE

Sum of Squares for Errors (SSE)

Trang 27

Calculate the standard error of estimate for Example 14.3 and describe what it tells you about the model fit.

Solution

45260

07220

Thus,

072

20307.378

4

)6207403

(893157

893157

)(

2 2

2 2

.

.

SSE s

.

.

SS

SS SS

SSE

n

y y

SS

x

xy y

i i

Trang 28

24 16

to the sample mean value of y

 In this example, the s is only 2.8% relative to the sample mean of y Therefore, we can conclude that the standard error of estimate is reasonably small

 s cannot be used alone as an absolute measure of the model’s utility But it can be used to compare models

Example 14.4 Solution…

Trang 29

Testing the Slope

• When no linear relationship exists between

two variables, the regression line should be

Trang 30

• We can draw inferences about the slope coefficient 1

from by testing

HA: 1  0 (a linear relationship exists between Y and X)

– The test statistic is

– If the error variable is normally distributed, the

statistic is student t-distribution with d.f = n – 2.

– The rejection region depends on whether or not we are performing a one or two tail test.

1 ˆ

1

1

s

ˆ t

1

1

s

ˆ t

1

  where

.

ˆ 1

Trang 31

If we wish to test for positive or negative linear relationships we conduct one-tail tests, i.e our research alternate hypotheses become:

HA: β1 < 0 (testing for a negative slope)

or

HA: β1 >0 (testing for a positive slope)

Of course, the null hypothesis remains: H0: β1 = 0

Trang 32

Solution (Solving by hand)

H0: 1= 0 (no linear relationship)

H A :  1  0 (a linear relationship exists)

– If the null hypothesis is rejected, we conclude that there is a significant linear relationship between price and odometer reading.

– The test statistic t has a t-distribution with 98 (=100–2) degrees

of freedom.

– Level of significance  = 0.05.

Test to determine whether there is enough evidence

to infer that a linear relationship exists between the price and the odometer reading at the 5% significance level

Example 14.4

(Example 14.3 continued)

Trang 33

• Decision rule:

Reject H o if | t | > t 0.025,98 = 1.984

• Comparing the decision rule with the calculated t-value (=–13.59), we reject Ho and conclude that the odometer readings do affect the sale price.

0069 0

4526 0

0937 0

1

1

1 1

1

.

s

ˆ t

.

SS

s s

ˆ

ˆ

x ˆ

To compute t we need the values of and .ˆ 1

1 ˆ

s

Trang 34

Using the computer

Excel regression output

Looking at the p-value of the slope coefficient, there is overwhelming evidence to infer that the odometer reading affects the auction selling price

Intercept 19.61139281 0.252410094 77.69655 7.53E-90 Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24

Example 14.4 Solution …

Trang 35

Coefficient of Determination

• The tests thus far are used to conclude whether a linear (positive or negative) relationship exists.

• When we want to measure the strength of the linear relationship, we use the coefficient of determination, R 2 , defined as follows.

• For a simple linear regression model, the coefficient of determination is the squared value of the coefficient of correlation (r) I.e

R 2 = (r) 2

y

2 y

x

2 xy 2

SS

SSE 1

R

or SS

x

2 xy 2

SS

SSE 1

R

or SS

SS SS

Trang 36

• To understand the significance of this

coefficient, note:

Overall variability in y

the regression model

remains, in part, unex

plained the error

explained in

part by

Coefficient of Determination…

Trang 37

( ( yˆ1  y )2  ( yˆ2  y )2  ( y1  yˆ1)2  ( y2  yˆ2)2

Total variation in y = variation explained by

the regression line + unexplained variation (error)

Coefficient of Determination…

Trang 38

As we did with analysis of variance, we can partition the variation in y into two parts:

variation in y that remains unexplained (i.e due to

error)

amount of variation in y explained by variation in the

independent variable x.

Coefficient of Determination…

38

Trang 39

• R2 measures the proportion of the variation in

y that is explained by the variation in x

y

2

SS

SSE1

SST

SSE1

SST

SSE

SSTSST

SSR

SST = variation in y = SSR + SSE

• R2 takes on any value between zero and one

• R 2 = 1: perfect match between the line and the data points.

• R 2 = 0: there is no linear relationship between x and y.

Coefficient of Determination…

Trang 40

• In general, the higher the value of R2, the better

the model fits the data

• Unlike the value of a test statistic, the

coefficient of determination does not have a critical value that enables us to draw conclusions

Coefficient of Determination…

Trang 41

Find the coefficient of determination for Example 14.3 What does this statistic tell you about the model?

Solution – Solving by hand

6533

0 57.89

20.07 1

Therefore, 65% of the variation in the selling price is explained by the variation in odometer reading The rest (35%) remains unexplained by this model, i.e due to error

Example 14.7

(Example 14.3 continued)

Trang 42

Solution – Using the computer

From the regression output we have, R2 = 0.6533:

Trang 43

More on Excel’s Output

An analysis of variance (ANOVA) table for the

simple linear regression model can be given by:

Source Degrees of freedom Sums of squares Mean squares F-statistic

Trang 44

14.5 Using the Regression Equation

• If we are satisfied with how well the model fits the data, we can use it to make

predictions for y.

• Before using the regression model, we need to assess how well it fits the data.

Trang 45

Predict the selling price of a three-year-old Ford

Laser with 40 000 km on the odometer (refer to

We call this value ($15,862) a point prediction

Chances are though the actual selling price will be different, hence we can estimate the selling price in

Example 14.7

(Example 14.3 continued)

Trang 46

Prediction Interval and Confidence

Interval

Two intervals can be used to discover how closely

the predicted value will match the true value of y

prediction interval – for a particular value of y

confidence interval – for the expected value of y.

The confidence interval

2 n , 2

) x x

(

) x x

( n

1 s

2 n , 2

) x x

(

) x x

( n

1 s

t yˆ

The prediction interval

2 n , 2

) x x

(

) x x

( n

1 1 s

2 n , 2

) x x

(

) x x

( n

1 1 s

t

The prediction interval is wider than the confidence interval.

Trang 47

a Provide an interval estimate for the bidding price

on a Ford Laser with 40 000 km on the odometer

2 n , 2

) x x

(

) x x

( n

1 1 s

t yˆ

904 0

862

15 378

4307

01 36

40 100

1 1

4526 0

984 1

2

.

.

.

Trang 48

2 n ,

2

) x x (

) x x

( n

1 s

t

105 0 862

15 378

4307

) 01 36 40

( 100

1 4526

0 984

1 [15.862

Trang 49

What’s the Difference?

Prediction interval Confidence interval

Used to estimate the value of

one value of y (at given x) Used to estimate the mean

value of y (at given x)

y will be narrower than the prediction interval for the

same given value of x and confidence level This is because there is less error in estimating a mean value as opposed to predicting an individual value

Trang 50

Intervals with Excel…

Add-Ins > Data Analysis Plus > Prediction Interval

Trang 51

2 g

2

) x x (

) x x ( n

1 s t yˆ

2 2

) x x (

2 n

1 s t yˆ

  

2 i

2 2

) x x (

1 n

1 s t yˆ

The Effect of the Given Value

) 1 x (     ( x  1 )  x  1

g 1

0 ˆ x

ˆ

yˆ    

) 1 x

x

(

yˆ g  

) 1 x

x

(

yˆ g  

2 x

) 2 x (     ( x  2 )  x  2

1

x  1

x 

The confidence interval

when xg = xThe confidence interval

when xg = x  1

The confidence interval

Trang 52

• The coefficient values range between –1 and 1.

– If  = –1 (perfect negative linear association) or 

= +1 (perfect positive linear association) every point falls on the regression line.

– If  = 0 there is no linear association.

• The coefficient can be used to test for linear relationships between two variables.

Trang 53

Coefficient of Correlation…

We estimate its value from sample data with the

sample coefficient of correlation:

The test statistic for testing if  = 0 is:

which is student t-distributed with  = n–2 degrees of

freedom

Trang 54

• When there is no linear

relationship between two

variables,  = 0

• The hypotheses are:

H0:  = 0 (no linear relationship)

HA:   0 (a linear relationship exists)

• The test statistic is:

t

The statistic is student t-distributed with d.f = n – 2, provided the

variables are bivariate normally distributed

X

Y

Testing the Coefficient of Correlation

Trang 55

– The sample coefficient of

Test the coefficient of correlation to determine

if a linear relationship exists in the data of Example 14.3.

Example 14.9

(Example 14.3 continued)

Trang 56

59

131

2

r

n r

 The value of the t-statistic is

 Conclusion: There is sufficient evidence at the 5% level to infer that there is a linear relationship between the two variables

Example 14.9…

COMPUTE

Ngày đăng: 05/06/2014, 08:40

TỪ KHÓA LIÊN QUAN

w