Chapter 01_Classical Linear Regression tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập lớn về tất cả cá...
Trang 1Chapter 1:
CLASSICAL LINEAR REGRESSION
I MODEL:
Population model: Y = f(X1,X2, ,X k)+ε
- f may be any kind (linear, non-linear, parametric, non-parametric, )
Sample information:
i ik i
Y, 2, , , }=1
i ki k i
i
Y =β1+β2 2 +β3 3 + +β +ε
ki
i k X
Y
∂
∂
=
with other factors held constant
EX: C i =β1+β2Y i+εi
Dependent variable
Explanatory variable
or Regressor
Disturbance (error)
Trang 2C P M Y
C i
i
∂
∂
=
β ⇒ require 0 ≤ ≤ 1
Denotes:
Y =
n Y
Y Y
2 1
; X =
nk n
n
k k
X X
X
X X
X
X X
X
1
1
1
3 2
2 23
22
1 13
12
n
β
β β
2 1
and ε =
n
ε
ε ε
2 1
⇒ We have:
) 1 ( ) 1 ( ( ) 1 ( × = × × + ×
n k k n
II ASSUMPTIONS OF THE CLASSICAL REGRESSION MODEL:
Models are simplications of reality
We'll make a set of simplifying assumptions for the model
The assumptions relate to:
Assumption 1: Linearity The model is linear in the parameters
Y = X + ε
Assumption 2: Full rank X ij s are not random variables - or X ijs are random
variables that are uncorrelated with ε
There is no exact linear dependencies among the columns of X
This assumption will be necessary for estimation of the parameters (need no EXACT)
) ( 1
k n
X
Rank(X) = n is also OK
Trang 3Assumption 3: Exogeneity of the independent variables
0 ] ,
, , [ i X j1X j2 X j3 X jk =
This means that the independent variables will not carry useful information for prediction of εi
0 ) ( ,
1 0
]
Assumption 4:
≠
∀
=
=
=
j i Cov
n i Var
j i
i
0 ) , (
, 1 )
ε ε
σ ε
For any random vector Z =
n z
z z
2 1
, we can express its variance -
covariance matrix as:
=
−
−
) (Z E Z E Z Z E Z VarCov
−
−
−
−
−
−
=
×
×
m
m m
m m m
z E z z
E z z
E z z E z
z E z
z E z E
1 2 2
1 1
1
2 2
1 1
)) ( (
)) ( (
)) ( ( ) (
) (
) (
jth diagonal element is var(z j ) = σjj = σj 2
ijth element (i ≠ j) is cov(z i ,z j ) = σij
=
mn m
m
m ij z
E z E
σ σ
σ
σ σ
σ
σ σ
2 1
2 22
21
12 2 1
1 ( )) ] [(
So we have "covariance matrix" for the vector ε
[(
) (
0 0
εε ε
ε ε ε
Then the assumption (4) is equivalent:
Trang 4
=
=
2
2 2
2
0 0
0 0
0 0
) ' (
σ
σ
σ σ
εε
I E
⇔
≠
∀
=
=
=
) (
0 ) , (
) (hom
, 1 )
ation autocorrel no
j i Cov
ity oscedastic n
i Var
j i
i
ε ε
σ ε
Assumption 5: Data generating process for the regressors (Non-stochastic of X)
+ Xijs are not random variables
Notes: This assumption is different with assumption 3
0 ]
Assumption 6: Normality of Errors
] , 0 [
ε
+ Normality is not necessary to obtain many results in the regression model
+ It will be possible to relax this assumption and retain most of the statistic results
SUMMARY: The classical linear regression model is:
Y = X + ε
] , 0 [
ε
Rank(X) = k
X is non-stochastic
III LEAST SQUARES ESTIMATION:
(Ordinary Least Squares Estimation - OLS)
Our first task is to estimate the parameters of the model:
Y = X + ε with ε ~N[0,σ2I]
Trang 5Many possible procedures for doing this The choice should be based on "sampling properties" of estimates
Let's consider one possible estimation strategy: Least Squares
Denote βˆ is estimator of :
n e
e e
2 1
is estimated residuals (of ε) or e i is estimated of εi
For the ith observation:
)
( population unobserved
i i
Y = ′ β+ε =
) (
' ˆ
sample observed i
X β +
=
=
i i e e
1 2
e e e
n
i
i = ′
∑
=1
2
= (Y −Xβ)′(Y −Xβ) = Y′Y−βˆ′X′Y−Y′Xβˆ+βˆ′X′Xβˆ
= Y′Y−2βˆ′X′Y+βˆ′X′Xβˆ (βˆ′X′Y =Y′Xβˆ)
ˆ
β β β
β
X X Y X Y
Y Min ′ − ′ ′ + ′ ′
The necessary condition for a minimum:
0 ˆ
]
[ =
∂
′
∂
β
e
e
ˆ
] ˆ ˆ
2 [
=
∂
′
′ +
′
′
−
′
∂
β
β β
βX Y X X Y
Y
Y
X ′
′
βˆ =[ ˆ ˆ ˆ ]
2
1 β βk
nk k
k k
n
X X
X X
X X
X X
1
1 1 1
3 2 1
2 32
22 12
n Y
Y Y
2 1
Trang 6=[βˆ1 βˆ2 βˆk]
∑
∑
∑
i ik
i i i
Y X
Y X Y
2
Take the derivative w.r.t eachβˆ :
β
β
ˆ
] ˆ
[
∂
′
′
∂ X Y
Y X
Y X Y
i ik
i i
i
′
=
∑
∑
∑
2
X
X ′ =
nk k
k k
n
X X
X X
X X
X X
1
1 1 1
3 2 1
2 32
22 12
nk n
n
k k
X X
X
X X
X
X X
X
1
1
1
3 2
2 23
22
1 13
12
=
∑
∑
∑
∑ ∑
∑
∑
∑
∑
∑
∑
2 3
2
2 3
2 2
2 2
3 2
ik i
ik i
ik ik
ik i
i i i
i
ik i
i
X X
X X
X X
X X X
X X
X
X X
X n
Symmetric Matrix of sums of squares and cross products:
β
βˆ′X ′ Xˆ: quadratic form
∑∑
′
=
′
i k
j
j i ij X X X
X
1 1
ˆ ˆ ) ( ˆ
β
Take the derivatives w.r.t eachβˆ : i
→
=
′
→
′
′
≠
′
=
∂
′
∂
=
∂
∂
n j X
X X
X
X X i
j
X X X
X i
j
j ij i
j ji
j i ij
i ij i
i ij
, 1 ˆ
) ( 2 ˆ
ˆ ) (
ˆ ˆ ) ( :
ˆ ) ( 2 ˆ
] ˆ ) [(
:
ˆ /
2
β β
β
β β
β β
β
β
Trang 7→ X X j n
X X
X X
j ij i
j ji
j i ij
, 1 ˆ
) ( 2 ˆ
ˆ ) (
ˆ ˆ )
→
′
β β
β
β
β
β
) ( 2 ˆ
] ) ( ˆ [
X X X
∂
′
′
∂
ˆ
] [
=
∂
′
∂
β
e e
⇔ −2X′Y +2(X′X)βˆ =0 (call "Normal equations")
→ (X′X)βˆ= X ′ → Y = X′X −1X′Y
) ( ˆ
β
IV ALGEBRAIC PROPERTIES OF LEAST SQUARES:
1 "Orthogonality condition":
⇔ −2X′Y +2(X′X)βˆ =0 (Normal equations)
⇔ ′( − )=0
e
X Y
⇔ X ′e=0
⇔
nk k
k k
n
X X
X X
X X
X X
1
1 1 1
3 2 1
2 32
22 12
n e
e e
2 1
=
0
0 0
e X
e n
i
i ij
n
i
i
, 1 0
0
1
=
=
∑
∑
=
=
2 Deviation from mean model (The fitted regression passes through X , Y )
n i e X X
X
Y i =βˆ1+βˆ2 2i +βˆ3 3i + +βˆk ki+ i =1,
Sum overall n observations and divide by n
Trang 8
0 1 3
3 2 2
=
+ +
+ +
+
i i k
X X
Then:
n i e X X X
X X
X Y
Y i − =βˆ1+βˆ2( 2i− 2)+βˆ3( 3i− 3)+ +βˆk( ki− k)+ i =1,
In model in deviation form, the intercept is put aside and can be found later
3 The mean of the fitted values Yˆ is equal to the mean of the actual Y i i value in the sample:
Y
X
i
+
′
ˆ
ˆ
β
=
n
i i Y
1
=
n
i i Y
1
ˆ +
0 1
∑
=
n
i i
e i=1,n
Note that: These results used the fact that the regression model include an intercept term
V PARTITIONED REGRESSION: FRISH-WAUGH THEOREM:
1 Note: Fundamental idempotent matrix (M):
βˆ
X Y
e= − ′
) ( )
X
X X X X Y
I
n n n
n
] ) (
×
−
×
′
′
−
=
)]
( ) ( )
X X X X X X
)]
) ( )
[(Xβ +ε −Xβ −X X′X 1X′ε
ε
] ) ( [
) (
1
n
M
X X X X I
×
′
′
−
So residuals vector e has two alternative representations:
Trang 9
=
= ε
M e
MY e
M is the "residual maker" in the regression of Y on X
M is symmetric and idempotent, that is:
=
′
=
M M M
M M
−
=
X X X X I
X X X X
I =I−X(X'X)−1X'=M
Note: (AB)′=B′A′
M
M = − ′ − 1 ′ − ′ − 1 ′
) ( )
(
= I −X X′X −1X′
)
)
I
′
′
′
′
) ( )
M X X X X
I− ′ ′=
) ( Also we have:
k n n
n
n k k k k n k n
X X X X X X X X X X X X X I MX
×
×
×
−
×
×
×
′
−
2 Partitioned Regression:
Suppose that our matrix of regressors is partitioned into two blocks:
[ 1 2] ( 1 2 )
2 1
k k k X
X X
k n k n k n
= +
=
×
×
×
×
×
×
×
2 2 1
1k n k k k
n n
X X
Y
n
+
=
1 2 1
ˆ ] [
β β
The normal equations:
(X′X)βˆ = X′Y
⇔ [X1 X2]′[X1 X2]βˆ =[X1 X2]′Y
Trang 10⇔ Y
X
X X
X X
X
′
′
=
′
′
2 1
2
1 2 1 2
1
ˆ
ˆ ] [
β β
′
′
=
′
′
′
′
Y X
Y X X
X X X
X X X X
2 1
2
1 2 2 1 2
2 1 1 1
ˆ
ˆ
β β
⇔
′
=
′ +
′
′
=
′ +
′
) ( ˆ
) (
ˆ ) (
) ( ˆ
) ( ˆ ) (
2 2 2 2 1 1 2
1 2 2 1 1 1 1
b Y
X X
X X
X
a Y
X X
X X
X
β β
β β
From (a) → (X1′X1)βˆ1 = X1′(−X2βˆ2 +Y)
or βˆ1 =(X1′X1)−1X1′(−X2βˆ2 +Y) (c)
Put (c) into (b):
(X2′X1)(X1′X1)−1X1′(−X2βˆ2 +Y)+(X2′X2)βˆ2 = X2′Y
⇔ −X2′X1(X1′X1)−1X1′X2βˆ2 +(X2′X2)βˆ2 = X2′Y −X2′X1(X1′X1)−1X1′Y
⇔ X I X X X X X X I X X X X Y
n n M n
n M
) (
1 1 1 1 1 2
2 2 )
(
1 1 1 1 1
×
−
×
′
−
We have: (X2′M1X2)βˆ2 =X2′M1Y → βˆ2 =(X2′M1X2)−1X2′M1Y
Because
=
′
=
M M M
M M
*
*
1 1 2 2 2 1 1
(
Y X
Y M M X X
M M
*
* 2 2
* 2
*
2' )ˆ '
2 1
* 2
* 2
2 ( ' ) '
β
Where:
=
′
′
=
→
=
Y M Y
M X X X M X
1
*
1 2
* 2 2 1
*
Interpretation:
1
×
n Y on
1
1
k n X
×
Trang 11• * 1 2
2 M X
1
1
k n
X
×
X2) on X1 and get the matrix of the residuals
1 1 1 1 1
1 Y Yˆ Y X [(X X ) X Y] M Y Y
e
n
=
=
′
′
−
=
−
×
1
1
k n
X
×
):
2 2 1 1 2
1
k n k k k n k
n
X E
×
×
×
×
+
= − =
×
×
×
) ˆ (
2 2 2
2 2
k n k n k
n
X X E
ˆ )
(
2 1 1 2
1 2
k k k n k n
X X
×
×
×
=[I−X1(X1′X1)−1X1′]X2 =M1X2 = X*2
1
1
×
n
e , and fit a regreesion: now we regress e1 on E:
u E e
k k n n
+
=
×
×
1
2 2
~
β
then we will have:
2
2 ˆ
~ β
β =
We get the same results as if we just regress the whole model
This results is called the "Frisch - Waugh" theorem
X1 = Ability (test scores)
ε β
= X1 1 X2 2 Y
Trang 12Y* = residuals from regression of Y on X1 (= variation in wages when controlling for ability)
X* = residuals from regression of X2 on X1
Then regress Y* on X* → get 2 : Y = X* 2 +u
2
Example: De-trending, de-seasonaling data:
1 1 2 2 1
1 1 1
2 2
1× × × ×
×
×
+ +
=
n k k n k n n
X t
=
n
t
2 1
either include "t" in model or "de-trend" X2 & Y variables by regressing on "t" & taking residuals
Note: Including trend in regression is an effective way de-trending of data
VI GOODNESS OF FIT:
One way of measuring the "quality of the fitted regression line" is to measure the extent to which the sample variable for the Y variable is explain by the model
∑
=
−
n
i
Y
n 1
2
) (
1
=
−
n
i
Y
1
2
) (
e Y e X
Y = βˆ+ = ˆ+
Y X X X X X
Y = = ′ −1 ′
) ( ˆ
Now consider the following matrix:
Trang 13
=
1 1 1
n I M
n n C
where
=
×
1
1 1 1
1
~
n
Note that:
Y
M0 =
−
n n
n
n n
n
n n
n
1 1
1
1 1
1
1 1
1
1 0
0
0 1
0
0 0
1
n Y
Y Y
2 1
=
n n n
n
n n n
n n n
Y Y
Y
Y Y
Y
Y Y
Y
1 1
1
1 1
1 2
1 1
1 1
=
−
−
−
Y Y
Y Y
Y Y
n
2 1
We have:
• M0
~
1 =
~
0
• Y′M M Y =Y′M Y =
Y M
0 0
)' (
0
0 ' ( ∑
=
−
n
i
Y
1
2
) (
M0Y =M0Xβˆ+M0e=M0Yˆ+M0e
Recall that:
~
0
=
′
=
X (∑e i =0→M0e=e)
~ 0
0' = ′ =0
′M X e M X e
→ Y ′ Y M0 = (Xβˆ+ e)′(M0Xβˆ+M0e)
= (βˆ'X′+e')(M0Xβˆ+M0e)
Trang 14= βˆ'X′M0Xβˆ+βˆ'X′M0e+e′M0Xβˆ+e′M0e
X M
X ′ +e′M0e
So:
SSE
n
i i
SSR
n
i i
SST
n
i
∑
=
=
=
+
−
=
−
1 2 1
2 1
2
) ˆ ( )
(
(Yˆ−Y ; Yˆ =Xβˆ so βˆ' 0 βˆ
X M
X ′ = Yˆ′ Y M0ˆ = ∑
=
−
n
i
Y
1
2
) ˆ ( )
SST: Total sum of squares
SSR: Regression sum of squares SSE: Error sum of squares Coefficient of Determination:
SST
SSE SST
SSR
R2 = =1−
(only if intercept included in models)
SST
SSR R
1 1
2 = − ≤
SST
SSE R
⇒ 0 ≤ R2≤ 1
What happens if we add any regressor(s) to the model?
) 1 (
1
1β +ε
= X Y
= + +
Y 1β1 2β2 Xβ+u (2)
(A) Applying OLS to (2)
u u'ˆ min
) ˆ ˆ
2
1 β β
(B) Applying OLS to (1)
Trang 15e e'
min
) ( β 1
in (A) must be ≤ that in (B) so uˆ'u=e'e
→ Adding any regression(s) to the model cannot increase (typically decrease) the sum
very interesting measure of the quality of regression
For this reason, we often use the "Adjusted" R2-Adjusted for "degree of freedom":
′
′
−
=
Y M Y
e e
−
′
−
′
−
=
) 1 /(
) /(
1 0
2
n Y M Y
k n e e R
Note: e′e=Y′M0Y and rank(M) = (n-k)
=
Y
M
Y' 0 ∑
=
−
n
i
Y
1
2
) ˆ ( d of freedom = n-1
2
R may ↑ or ↓ when variables are added It may even be negative
Note that: If the model does not include an Intercept, then the equation: SST = SSR + SSE
does not hold And we no longer have 0 ≤ R2≤ 1 We must also be careful in comparing R2
across different models For example:
(2) logC i =0.2+0.7logY i +u R2 = 0.7
In (1) R2 relates to sample variation of the variable C In (2), R2 relates to sample variation of
the variable log(C) Reading Home: Greene, chapter 3&4