Introduction Our problem objective is to analyse the relationship between numerical variables; regression analysis is the first tool we will study. Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables). Dependent variable: denoted Y Independent variables: denoted X1, X2, …, Xk Correlation Analysis… If we are interested only in determining whether a relationship exists, we employ correlation analysis, a technique introduced earlier. This chapter will examine the relationship between two variables, sometimes called simple linear regression. Mathematical equations describing these relationships are also called models, and they fall into two types: deterministic or probabilistic.
Trang 2Chapter 14
Simple linear regression and correlation
Trang 3Our problem objective is to analyse the
relationship between numerical variables;
regression analysis is the first tool we will study
Regression analysis is used to predict the value of one variable (the dependent variable) on the basis
of other variables (the independent variables)
Dependent variable: denoted Y
Independent variables: denoted X 1 , X 2 , …, X k
Trang 4Correlation Analysis…
If we are interested only in determining whether a
a technique introduced earlier.
This chapter will examine the relationship between
regression.
Mathematical equations describing these relationships
are also called models, and they fall into two types:
deterministic or probabilistic
4
Trang 5Model Types…
Deterministic Model: an equation or set of equations
that allow us to fully determine the value of the
dependent variable from the values of the independent variables.
Contrast this with…
Probabilistic Model: a method used to capture the
randomness that is part of a real-life process.
E.g do all houses of the same size (measured in square metre) sell for exactly the same price?
Trang 6A Model…
To create a probabilistic model, we start with a
deterministic model that approximates the
relationship we want to model and add a random term that measures the error of the deterministic component
Trang 7A model of the relationship between house size (independent variable) and house price (dependent variable) would be:
House size
House price
Most lots sell
Trang 8In real life, however, the house cost will vary even among the same size of house:
Same house size, but different price points
(e.g décor options, cabinet upgrades, lot location…).
House price
200K$
Lower vs higher variability
House price = 200 000 + 800(Size) +
8
A Model…
Trang 9Random Term…
We now represent the price of a house as a function of its size in this probabilistic model:
y = 200 000 + 800x +
where (Greek letter epsilon) is the random term (a.k.a
error variable) It is the difference between the actual
selling price and the estimated price based on the size of the house Its value will vary from house sale to house sale,
even if the area of the house (i.e x) remains the same due
to other factors such as the location, age, décor etc of the house.
Trang 1014.1 Simple Linear Regression Model
A straight line model with one independent
variable is called a first order linear model or
a simple linear regression model It is written
Trang 11Simple Linear Regression Model…
=y-intercept
Trang 12In much the same way we base estimates of µ on x,
we estimate β0 using and β1 using , the y-intercept
and slope (respectively) of the least squares or
regression line given by:
(Recall: this is an application of the least squares method and it produces a straight line that
minimises the sum of the squared differences
between the points and the line)
14.2 Estimating the Coefficients
x ˆ ˆ
yˆ ˆ 0 ˆ 1x
yˆ 0 1
12
Trang 13Least Squares Method
The question is:
• Which straight line fits best?
• The least squares line minimises the sum of squared differences between the points and the line
Trang 143 3
The best line is the one that minimises the sum of squared vertical differences between the points and the line
1
4
(1, 2)
2 2
Line 2
Line 1
Least Squares Method…
Trang 15To calculate the estimates of the coefficients that
minimise the differences between the data points
and the line, use the formulas:
x
ˆ y
ˆ
x n x
y x n y
x
or n
) x
( x
n
) y x
( y
x ˆ
1 0
2
2 i
i
i 2
i
2 i
i
i i
i 1
ˆ
x n x
y x n y
x
or n
) x
( x
n
) y x
( y
x ˆ
1 0
2
2 i
i
i 2
i
2 i
i
i i
i 1
Trang 16x n
) y x
( y
x SS
y n
y n
) y
( y
SS
x n
x n
) x
( x
SS
i i
i
i i
i xy
2
2 i
2 i
2 i y
2
2 i
2 i
2 i x
x n
) y x
( y
x SS
y n
y n
) y
( y
SS
x n
x n
) x
( x
SS
i i
i
i i
i xy
2
2 i
2 i
2 i y
2
2 i
2 i
2 i x
Least Squares Estimates…
Trang 17ˆ y
ˆ
SS
SS ˆ
1 0
x
xy 1
ˆ
SS
SS ˆ
1 0
x
xy 1
The estimated simple linear regression equation that
estimates the equation of the first-order linear model is:
x ˆ ˆ
yˆ ˆ ˆ x
yˆ
Least Squares Estimates…
Trang 18A car dealer wants to find the
relationship between the
odometer reading and the
selling price of used cars
A random sample of 100 cars
is selected and the data are
recorded in file XM21-03
Find the regression line
Car Odometer Price
Trang 19To calculate and we need to calculate
several statistics first:
; y
; x
24 16
01 36
where n = 100
x
x
ˆ ˆ
yˆ 0 1 19 611 0 094
-403.6207
307.378 4
( y
x SS
n
) x
( x
SS
i
i i
i xy
i i
x
2 2
19.611 )
)(
(
-0.0937 307.378
4
403.6207 -
24 16
1 0
1
.
x
ˆ y ˆ
SS
SS ˆ
x xy
Trang 20Data > Data Analysis > Regression >
[Highlight the data y range and x range] > OK
.
yˆ 19 611 0 094
Trang 21This is the slope of the line.
For each additional mile on the odometer, the price decreases by an average of $0.094
x
Do not interpret the intercept as the
‘price of cars that have not been driven’.
10 12 14 16 18 20
Trang 22– The mean of is zero: E() = 0.
– The standard deviation of is a constant () for all values of x
– The errors are independent
– The errors are independent of the independent variable x
– The probability distribution of is normal
Trang 23From the first three assumptions we have:
y is normally distributed with mean E(y) = 0 + 1x and a constant standard deviation
From the first three assumptions we have:
y is normally distributed with mean E(y) = 0 + 1x and a constant standard deviation
The standard deviation remains constant
… but the mean value changes with x.
Trang 2414.4 Assessing the Model
• The least squares method will produce a regression line whether or not there is a linear relationship between x and y
• Consequently, it is important to assess how well the linear model fits the data
• Several methods are used to assess the model:
– testing and/or estimating the regression model
coefficients
– using descriptive measurements such as the sum
of squares for errors (SSE).
Trang 25– This is the sum of differences between the
points and the regression line
– It can serve as a measure of how well the
line fits the data
– The sum of squares for errors is calculated as
– This statistic plays a role in every statistical technique we employ to assess the model
x
xy y
SS
SS SS
SS
SS SS
SSE
2
OR
)
i
i y y
SSE
Sum of Squares for Errors (SSE)
Trang 27Calculate the standard error of estimate for Example 14.3 and describe what it tells you about the model fit.
Solution
45260
07220
Thus,
072
20307.378
4
)6207403
(893157
893157
)(
2 2
2 2
.
.
SSE s
.
.
SS
SS SS
SSE
n
y y
SS
x
xy y
i i
Trang 2824 16
to the sample mean value of y
In this example, the s is only 2.8% relative to the sample mean of y Therefore, we can conclude that the standard error of estimate is reasonably small
s cannot be used alone as an absolute measure of the model’s utility But it can be used to compare models
Example 14.4 Solution…
Trang 29Testing the Slope
• When no linear relationship exists between
two variables, the regression line should be
Trang 30• We can draw inferences about the slope coefficient 1
from by testing
HA: 1 0 (a linear relationship exists between Y and X)
– The test statistic is
– If the error variable is normally distributed, the
statistic is student t-distribution with d.f = n – 2.
– The rejection region depends on whether or not we are performing a one or two tail test.
1 ˆ
1
1
s
ˆ t
1
1
s
ˆ t
1
where
.
ˆ 1
Trang 31If we wish to test for positive or negative linear relationships we conduct one-tail tests, i.e our research alternate hypotheses become:
HA: β1 < 0 (testing for a negative slope)
or
HA: β1 >0 (testing for a positive slope)
Of course, the null hypothesis remains: H0: β1 = 0
Trang 32Solution (Solving by hand)
H0: 1= 0 (no linear relationship)
H A : 1 0 (a linear relationship exists)
– If the null hypothesis is rejected, we conclude that there is a significant linear relationship between price and odometer reading.
– The test statistic t has a t-distribution with 98 (=100–2) degrees
of freedom.
– Level of significance = 0.05.
Test to determine whether there is enough evidence
to infer that a linear relationship exists between the price and the odometer reading at the 5% significance level
Example 14.4
(Example 14.3 continued)
Trang 33• Decision rule:
Reject H o if | t | > t 0.025,98 = 1.984
• Comparing the decision rule with the calculated t-value (=–13.59), we reject Ho and conclude that the odometer readings do affect the sale price.
0069 0
4526 0
0937 0
1
1
1 1
1
.
s
ˆ t
.
SS
s s
ˆ
ˆ
x ˆ
• To compute t we need the values of and .ˆ 1
1 ˆ
s
Trang 34Using the computer
Excel regression output
Looking at the p-value of the slope coefficient, there is overwhelming evidence to infer that the odometer reading affects the auction selling price
Intercept 19.61139281 0.252410094 77.69655 7.53E-90 Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24
Example 14.4 Solution …
Trang 35Coefficient of Determination
• The tests thus far are used to conclude whether a linear (positive or negative) relationship exists.
• When we want to measure the strength of the linear relationship, we use the coefficient of determination, R 2 , defined as follows.
• For a simple linear regression model, the coefficient of determination is the squared value of the coefficient of correlation (r) I.e
R 2 = (r) 2
y
2 y
x
2 xy 2
SS
SSE 1
R
or SS
x
2 xy 2
SS
SSE 1
R
or SS
SS SS
Trang 36• To understand the significance of this
coefficient, note:
Overall variability in y
the regression model
remains, in part, unex
plained the error
explained in
part by
Coefficient of Determination…
Trang 37( ( yˆ1 y )2 ( yˆ2 y )2 ( y1 yˆ1)2 ( y2 yˆ2)2
Total variation in y = variation explained by
the regression line + unexplained variation (error)
Coefficient of Determination…
Trang 38As we did with analysis of variance, we can partition the variation in y into two parts:
variation in y that remains unexplained (i.e due to
error)
amount of variation in y explained by variation in the
independent variable x.
Coefficient of Determination…
38
Trang 39• R2 measures the proportion of the variation in
y that is explained by the variation in x
y
2
SS
SSE1
SST
SSE1
SST
SSE
SSTSST
SSR
SST = variation in y = SSR + SSE
• R2 takes on any value between zero and one
• R 2 = 1: perfect match between the line and the data points.
• R 2 = 0: there is no linear relationship between x and y.
Coefficient of Determination…
Trang 40• In general, the higher the value of R2, the better
the model fits the data
• Unlike the value of a test statistic, the
coefficient of determination does not have a critical value that enables us to draw conclusions
Coefficient of Determination…
Trang 41Find the coefficient of determination for Example 14.3 What does this statistic tell you about the model?
Solution – Solving by hand
6533
0 57.89
20.07 1
Therefore, 65% of the variation in the selling price is explained by the variation in odometer reading The rest (35%) remains unexplained by this model, i.e due to error
Example 14.7
(Example 14.3 continued)
Trang 42Solution – Using the computer
From the regression output we have, R2 = 0.6533:
Trang 43More on Excel’s Output
An analysis of variance (ANOVA) table for the
simple linear regression model can be given by:
Source Degrees of freedom Sums of squares Mean squares F-statistic
Trang 4414.5 Using the Regression Equation
• If we are satisfied with how well the model fits the data, we can use it to make
predictions for y.
• Before using the regression model, we need to assess how well it fits the data.
Trang 45Predict the selling price of a three-year-old Ford
Laser with 40 000 km on the odometer (refer to
We call this value ($15,862) a point prediction
Chances are though the actual selling price will be different, hence we can estimate the selling price in
yˆ
Example 14.7
(Example 14.3 continued)
Trang 46Prediction Interval and Confidence
Interval
Two intervals can be used to discover how closely
the predicted value will match the true value of y
• prediction interval – for a particular value of y
• confidence interval – for the expected value of y.
The confidence interval
2 n , 2
) x x
(
) x x
( n
1 s
2 n , 2
) x x
(
) x x
( n
1 s
t yˆ
The prediction interval
2 n , 2
) x x
(
) x x
( n
1 1 s
2 n , 2
) x x
(
) x x
( n
1 1 s
t
yˆ
The prediction interval is wider than the confidence interval.
Trang 47a Provide an interval estimate for the bidding price
on a Ford Laser with 40 000 km on the odometer
2 n , 2
) x x
(
) x x
( n
1 1 s
t yˆ
904 0
862
15 378
4307
01 36
40 100
1 1
4526 0
984 1
2
.
.
.
Trang 482 n ,
2
) x x (
) x x
( n
1 s
t
yˆ
105 0 862
15 378
4307
) 01 36 40
( 100
1 4526
0 984
1 [15.862
Trang 49What’s the Difference?
Prediction interval Confidence interval
Used to estimate the value of
one value of y (at given x) Used to estimate the mean
value of y (at given x)
y will be narrower than the prediction interval for the
same given value of x and confidence level This is because there is less error in estimating a mean value as opposed to predicting an individual value
Trang 50Intervals with Excel…
Add-Ins > Data Analysis Plus > Prediction Interval
Trang 512 g
2
) x x (
) x x ( n
1 s t yˆ
2 2
) x x (
2 n
1 s t yˆ
2 i
2 2
) x x (
1 n
1 s t yˆ
The Effect of the Given Value
) 1 x ( ( x 1 ) x 1
g 1
0 ˆ x
ˆ
yˆ
) 1 x
x
(
yˆ g
) 1 x
x
(
yˆ g
2 x
) 2 x ( ( x 2 ) x 2
1
x 1
x
The confidence interval
when xg = xThe confidence interval
when xg = x 1
The confidence interval
Trang 52• The coefficient values range between –1 and 1.
– If = –1 (perfect negative linear association) or
= +1 (perfect positive linear association) every point falls on the regression line.
– If = 0 there is no linear association.
• The coefficient can be used to test for linear relationships between two variables.
Trang 53Coefficient of Correlation…
We estimate its value from sample data with the
sample coefficient of correlation:
The test statistic for testing if = 0 is:
which is student t-distributed with = n–2 degrees of
freedom
Trang 54• When there is no linear
relationship between two
variables, = 0
• The hypotheses are:
H0: = 0 (no linear relationship)
HA: 0 (a linear relationship exists)
• The test statistic is:
t
The statistic is student t-distributed with d.f = n – 2, provided the
variables are bivariate normally distributed
X
Y
Testing the Coefficient of Correlation
Trang 55– The sample coefficient of
Test the coefficient of correlation to determine
if a linear relationship exists in the data of Example 14.3.
Example 14.9
(Example 14.3 continued)
Trang 5659
131
2
r
n r
The value of the t-statistic is
Conclusion: There is sufficient evidence at the 5% level to infer that there is a linear relationship between the two variables
Example 14.9…
COMPUTE