1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Class Notes in Statistics and Econometrics Part 18 pps

21 123 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 21
Dung lượng 391,14 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In the same plot, plot the density function of the Theil-Schweitzer estimate.. Can one see from the comparison of these density functions why the Theil-Schweitzer estimator has a better

Trang 1

Least Squares as the Normal Maximum Likelihood

Estimate

Now assume ε is multivariate normal We will show that in this case the OLSestimator βˆis at the same time the Maximum Likelihood Estimator For this weneed to write down the density function of y First look at one yt which is yt ∼

Trang 2

density function for this one observation is

of normal variables are normal Normal distributions are characterized by theirmean vector and covariance matrix The distribution of the MLE of β is therefore

ˆ

β∼N(β, σ2(X>X)−1)

Trang 3

If we replace β in the log likelihood function (35.0.5) byβˆ, we get what is calledthe log likelihood function with β “concentrated out.”

Let’s look at the distribution ofs2(from which that of its scalar multiples followseasily) It is a quadratic form in a normal variable Such quadratic forms very oftenhave χ2 distributions

Now recall equation10.4.9characterizing all the quadratic forms of multivariatenormal variables that are χ2’s Here it is again: Assume yis a multivariate normal

Trang 4

vector random variable with mean vector µ and covariance matrix σ2Ψ, and ΩΩΩ is asymmetric nonnegative definite matrix Then (y− µ)>ΩΩ(y− µ) ∼ σ2χ2k iff

(35.0.9) ΨΩΩΩΨΩΩΩΨ = ΨΩΩΩΨ,

and k is the rank of ΨΩΩΩ

This condition is satisfied in particular if Ψ = I (the identity matrix) and

Ω2= ΩΩΩ, and this is exactly our situation

n−k/(n − k),from which one obtains again unbiasedness, but also that var[s2] = 2σ4/(n − k), aresult that one cannot get from mean and variance alone

Problem 395 4 points Show that, if y is normally distributed, s2 and βˆ areindependent

Answer We showed in question 300 that βˆand ε ˆ are uncorrelated, therefore in the normal case independent, therefore βˆis also independent of any function of ˆ ε, such as ˆ σ 2 

Trang 5

Problem 396 Computer assignment: You run a regression with 3 explanatory

variables, no constant term, the sample size is 20, the errors are normally distributed

and you know that σ2 = 2 Plot the density function of s2 Hint: The command

dchisq(x,df=25) returns the density of a Chi-square distribution with 25 degrees of

freedom evaluated at x But the number 25 was only taken as an example, this is not

the number of degrees of freedom you need here

• a In the same plot, plot the density function of the Theil-Schweitzer estimate

Can one see from the comparison of these density functions why the Theil-Schweitzer

estimator has a better MSE?

Answer Start with the Theil-Schweitzer plot, because it is higher > x <- seq(from = 0, to

= 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l",

lty=2) > lines(x,(17/2)*dchisq((17/2)*x, df=17)) > title(main = "Unbiased versus Theil-Schweitzer

Now let us derive the maximum likelihood estimator in the case of

nonspher-ical but positive definite covariance matrix I.e., the model is y = Xβ +ε, ε ∼

N(o, σ2Ψ) The density function is

(35.0.11) fy(y) = (2πσ2)−n/2|det Ψ|−1/2exp− 1

2σ2(y − Xβ)>Ψ−1(y − Xβ)

Trang 6

Problem 397 Derive (35.0.11) as follows: Take a matrix P with the propertythat Pε has covariance matrix σ2I Write down the joint density function of Pε.Sinceyis a linear transformation ofε, one can apply the rule for the density function

of a transformed random variable

Answer Write Ψ = QQ> with Q nonsingular and define P = Q−1 and v = P ε Then V[v] = σ 2 P QQ>P>= σ 2 I, therefore

(35.0.12) fv(v) = (2πσ2)−n/2exp



− 12σ 2 v>v



For the transformation rule, write v, whose density function you know, as a function of y, whose density function you want to know v = P (y − Xβ); therefore the Jacobian matrix is ∂v/∂y > =

∂(P y − P Xβ)/∂y > = P , or one can see it also element by element

Trang 7

From (35.0.11) one obtains the following log likelihood function:

σ2, i.e., we will compute the maximum of this likelihood function over σ2 for anygiven set of values for the data and the other parameters:

Trang 8

This objective function has to be maximized with respect to β and the parametersentering Ψ If Ψ is known, then this is clearly maximized by the βˆminimizing(26.0.9), therefore the GLS estimator is also the maximum likelihood estimator.

If Ψ depends on unknown parameters, it is interesting to compare the mum likelihood estimator with the nonlinear least squares estimator The objectivefunction minimized by nonlinear least squares is (y− Xβ)>Ψ−1(y− Xβ), which

maxi-is the sum of squares of the innovation parts of the residuals These two objectivefunctions therefore differ by the factor (det[Ψ])n1, which only matters if there areunknown parameters in Ψ Asymptotically, the objective functions are identical

Using the factorization theorem for sufficient statistics, one also sees easily thatˆ

σ2 andβˆtogether form sufficient statistics for σ2 and β For this use the identity

Trang 9

Problem 398 12 points The log likelihood function in the linear model is given

by (35.0.5) Show that the inverse of the information matrix is

Answer The log likelihood function can be written as

Trang 10

The first derivatives were already computed for the maximum likelihood estimators:

∂β>ln ` = −

1 2σ 2 (2y>X + 2β>X>X) = 1

σ 2 (y − Xβ) > X = 1

σ 2 ε>X (35.0.23)

∂σ 2 ln ` = − n

2σ 2 + 12σ 4 (y − Xβ) > (y − Xβ) = − n

2σ 2 + 12σ 4 ε>ε (35.0.24)

By the way, one sees that each of these has expected value zero, which is a fact that is needed to prove consistency of the maximum likelihood estimator.

The formula with only one partial derivative will be given first, although it is more tedious:

(35.0.26) E[ − n

2σ 2 + 1

2σ 4 ε>ε2] = var[− n

2σ 2 + 12σ 4 ε>ε] = var[ 1

2σ 4 ε>ε] = 1

4σ 8 2nσ4= n

2σ 4 One of the off-diagonal elements is (2σn4 +2σ16ε>ε)ε>X Its expected value is zero: E[ε] = o, and also E[ε εε ε > ε] = o since its ith component is E[ε iPjε 2 ] = P

j E[ε i ε 2 ] If i 6= j , then ε i is independent of ε 2 , therefore E[ε i ε 2 ] = 0 · σ 2 = 0 If i = j, we get E[ε 3

i ] = 0 since ε i has a symmetric distribution.

Trang 11

It is easier if we differentiate once more:

∂ 2

∂β∂β>ln ` = −

1

σ 2 X>X (35.0.27)

∂ 2

∂β∂σ 2 ln ` = − 1

σ 4 X>(y − Xβ) = − 1

σ 4 X>ε (35.0.28)

∂ 2 (∂σ 2 ) 2 ln ` = n

2σ 4 − 1

σ 6 (y − Xβ)>(y − Xβ) = n

2σ 4 − 1

σ 6 ε>ε (35.0.29)

This gives the top matrix in [ JHG + 88 , (6.1.24b)]:

s 2 with var[s 2 ] = 2σ 4 /(n − k) does not attain the bound However one can show with other means

Trang 13

Bayesian Estimation in the Linear Model

The model isy= Xβ+εwithε∼N(o, σ2I) Bothyandβare random Thedistribution of β, called the “prior information,” is β ∼N(ν, τ2A−1) (Bayesianswork with the precision matrix, which is the inverse of the covariance matrix) Fur-thermoreβandεare assumed independent Define κ2= σ2/τ2 To simplify matters,

we assume that κ2 is known

Whether or not the probability is subjective, this specification implies that y

andβ are jointly Normal and

(36.0.32)





∼Xνν

, τ2XA−1X>+ κ2I XA−1

A−1X> A−1



Trang 14

We can use theorem ?? to compute the best linear predictor βˆ(y) of β on thebasis of an observation ofy Due to Normality,βˆis at the same time the conditionalmean or “posterior mean” βˆ = E[β|y], and the MSE -matrix is at the same timethe variance of the posterior distribution of β given y MSE [βˆ;β] = V[βˆ|y] Aproof is given as answer to Question ?? Since one knows mean and variance of theposterior distribution, and since the posterior distribution is normal, the posteriordistribution of β given y is known This distribution is what the Bayesians areafter The posterior distribution combines all the information, prior informationand sample information, aboutβ.

According to (??), this posterior mean can be written as

Trang 15

These formulas are correct, but the Bayesians use mathematically equivalent formulaswhich have a simpler and more intuitive form The solution of (36.0.34) can also bewritten as

B∗= (X>X + κ2A)−1X>,(36.0.35)

and (36.0.33) becomes

ˆ

β= (X>X + κ2A)−1(X>y+ κ2Aν)(36.0.36)

= (X>X + κ2A)−1(X>Xβˆ+ κ2Aν)(36.0.37)

where βˆ= (X>X)−1X>y is the OLS estimate Bayesians are interested inβˆcause this is the posterior mean The MSE -matrix, which is the posterior covariancematrix, can also be written as

be-(36.0.38) MSE[βˆ;β] = σ2(X>X + κ2A)−1

Problem 399 Show that B∗ as defined in (36.0.35) satisfies (36.0.34), that(36.0.33) with this B∗ becomes (36.0.36), and that (36) becomes (36.0.38)

Trang 16

Answer ( 36.0.35 ) in the normal equation ( 36.0.34 ) gives

(36.0.39)

(X>X + κ2A)−1X>(XA−1X>+ κ2I) = (X>X + κ2A)−1(X>XA−1X>+ κ2X>) =

= (X>X + κ2A)−1(X>X + κ2A)A−1X>= A−1X> Now the solution formula:

ˆ

β = ν + (X>X + κ2A)−1X>(y − Xν) (36.0.40)

(36.0.44)

A−1X>X +κ2I −A−1X>(XA−1X +κ2I)−1XA−1X>X −κ2A−1X>(XA−1X +κ2I)−1X =

= A−1X>X + κ2I − A−1X>(XA−1X + κ2I)−1(XA−1X + κ2I)X = κ2I



Trang 17

The formula (36.0.37) can be given two interpretations, neither of which is essarily Bayesian First interpretation: It is a matrix weighted average of the OLSestimate and ν, with the weights being the respective precision matrices If ν = o,then the matrix weighted average reduces toβˆ= (X>X + κ2A)−1X>y, which hasbeen called a “shrinkage estimator” (Ridge regression), since the “denominator” isbigger: instead of “dividing by” X>X (strictly speaking, multiplying by (X>X)−1),one “divides” by X>X + κ2A If ν 6= o then the OLS estimateβˆis “shrunk” not

nec-in direction of the orignec-in but nec-in direction of ν

Second interpretation: It is as if, in addition to the data y= Xβ+ε, also anindependent observation ν =β+δwith δ ∼N(o, τ2A−1) was available, i.e., as ifthe model was



β+



εδ

with



εδ



∼oo

, τ2σ2I O

O A−1



The Least Squares objective function minimized by the GLS estimatorβ=βˆin(36.0.45) is:

(36.0.46) (y− Xβ)>(y− Xβ) + κ2(β− ν)>A(β− ν)

In other words,βˆis chosen such that at the same time Xβˆis close to yandβˆ

close to ν

Trang 18

Problem 400 Show that the objective function (36.0.46) is, up to a constant

factor, the natural logarithm of the product of the prior density and the likelihood

function (Assume σ2 and τ2 known) Note: if z ∼ N(θ, σ2ΣΣ) with nonsingular

covariance matrix σ2ΣΣ, then its density function is



; likelihood function (2πσ 2 )−n/2exp



−(y−Xβ)>(y−Xβ)2σ 2



; the posterior density is then proportional to the product of the two:

Although frequentist and Bayesian approaches lead here to identical formulas,

the interpretation is quite different The BLUE/BLUP looks for best performance

in repeated samples if y, while the Bayesian posterior density function is the best

update of the prior information aboutβby information coming from this one set of

observations

Here is a textbook example of how Bayesians construct the parameters of their

prior distribution:

Trang 19

Problem 401 As in Problem 274, we will work with the Cobb-Douglas duction function, which relates output Q to the inputs of labor L and capital K asfollows:

Trang 20

and that our prior knowledge about α is not affected by (is independent of ) ourprior knowledge concerning β and γ Assume that σ2 is known and that it has thevalue σ2 = 0.09 Furthermore, assume that our prior views about α, β, and γ can

be adequately represented by a normal distribution Compute from this the priordistribution of the vector β=α β γ>

.Answer This is [ JHG+88 , p 288–290] 

Here is my personal opinion what to think of this I always get uneasy when Isee graphs like [JHG+88, Figure 7.2 on p 283] The prior information was specified

on pp 277/8: the marginal propensity to consume is with high probability between0.75 and 0.95, and there is a 50-50 chance that it lies above or below 0.85 The leastsquares estimate of the MPC is 0.9, with a reasonable confidence interval There is

no multicollinearity involved, since there is only one explanatory variable I see noreason whatsoever to take issue with the least squares regression result, it matches

my prior information perfectly However the textbook tells me that as a Bayesian Ihave to modify what the data tell me and take the MPC to be 0.88 This is onlybecause of the assumption that the prior information is normal

I think the Bayesian procedure is inappropriate here because the situation is sosimple Bayesian procedures have the advantage that they are coherent, and thereforecan serve as a guide in complex estimation situations, when the researcher is tempted

Trang 21

to employ ad-hoc procedures which easily become incoherent The advantage of aBayesian procedure is therefore that it prevents the researcher from stepping on hisown toes too blatantly In the present textbook situation, this advantage does nothold On the contrary, the only situation where the researcher may be tempted

to do something which he does not quite understand is in the above eliciation ofprior information It often happens that prior information gained in this way is self-contradictory, and the researcher is probably not aware what his naive assumptionsabout the variances of three linear combinations of two parameters imply for thecorrelation between them!

I can think of two justifications of Bayesian approaches In certain situationsthe data are very insensitive, without this being a priori apparent Widely differentestimates give an almost as good fit to the data as the best one In this case theresearcher’s prior information may make a big difference and it should be elicited

Another justification of the Bayesian approach is the following: In many life situations the data manipulation and estimation which is called for is so complexthat the researcher no longer knows what he is doing In such a situation, a Bayesianprocedure can serve as a guideline The prior density may not be right, but at leasteverything is coherent

Ngày đăng: 04/07/2014, 15:20

TỪ KHÓA LIÊN QUAN