1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Class Notes in Statistics and Econometrics Part 19 pdf

13 160 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 13
Dung lượng 333,43 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

This condition is a simple extension of 29.6.6.An estimator of the form βˆ= X>X + κ2I−1X>y, where κ2 is a constant, is called “ordinary ridge regression.” Ridge regression can be conside

Trang 1

CHAPTER 37

OLS With Random Constraint

A Bayesian considers the posterior density the full representation of the informa-tion provided by sample and prior informainforma-tion Frequentists have discoveered that one can interpret the parameters of this density as estimators of the key unknown parameters, and that these estimators have good sampling properties Therefore they have tried to re-derive the Bayesian formulas from frequentist principles

If β satisfies the constraint Rβ = u only approximately or with uncertainty, it has therefore become customary to specify

(37.0.55) Rβ = u +η, η∼ (o, τ2Φ), ηandεuncorrelated

Here it is assumed τ2> 0 and Φ positive definite

Trang 2

Both interpretations are possible here: either u is a constant, which means nec-essarily that β is random, or β is as usual a constant and u is random, coming from whoever happened to do the research (this is why it is called “mixed estimation”)

It is the correct procedure in this situation to do GLS on the model

(37.0.56)



y

u



=X R



β +



ε

−η

 with



ε

−η



∼o o

 , σ2 I O

O κ12I



 Therefore

(37.0.57) βˆ= (X>X + κ2R>R)−1(X>y+ κ2R>u)

where κ2= σ2/τ2

Thisβˆis the BLUE if in repeated samples β and u are drawn from such distri-butions that Rβ − u has mean o and variance τ2I, but E[β] can be anything If one considers both β and u fixed, thenβˆis a biased estimator whose properties depend

on how close the true value of Rβ is to u

Under the assumption of constant β and u, the MSE matrix of βˆ is smaller than that of the OLSβˆif and only if the true parameter values β, u, and σ2 satisfy

(37.0.58) (Rβ − u)> 2

κ2I + R(X>X)−1R>

−1

(Rβ − u) ≤ σ2

Trang 3

This condition is a simple extension of (29.6.6).

An estimator of the form βˆ= (X>X + κ2I)−1X>y, where κ2 is a constant, is called “ordinary ridge regression.” Ridge regression can be considered the imposition

of a random constraint, even though it does not hold—again in an effort to trade bias for variance This is similar to the imposition of a constraint which does not hold An explantation of the term “ridge” given by [VU81, p 170] is that the ridge solutions are near a ridge in the likelihood surface (at a point where the ridge is close

to the origin) This ridge is drawn in [VU81, Figures 1.4a and 1.4b]

Problem 402 Derive from (37.0.58) the well-known formula that the MSE of ordinary ridge regression is smaller than that of the OLS estimator if and only if the true parameter vector satisfies

κ2I + (X>X)−1−1β ≤ σ2

Whatever the true values of β and σ2, there is always a κ2> 0 for which (37.0.59)

or (37.0.58) holds The corresponding statement for the trace of the MSE -matrix has been one of the main justifications for ridge regression in [HK70b] and [HK70a], and much of the literature about ridge regression has been inspired by the hope that

Trang 4

one can estimate κ2in such a way that the MSE is better everywhere This is indeed done by the Stein-rule

Ridge regression is reputed to be a good estimator when there is multicollinearity Problem 403 (Not eligible for in-class exams) Assume E[y] = µ, var(y) = σ2, and you make n independent observationsyi Then the best linear unbiased estimator

of µ on the basis of these observations is the sample mean ¯y For which range of values of α is MSE[α¯y; µ] < MSE[¯y; µ]? Unfortunately, this value depends on µ and can therefore not be used to improve the estimate

Answer.

MSE[α¯ y ; µ] = E(α¯ y − µ)2= E(α¯ y − αµ + αµ − µ)2< MSE[¯ y ; µ] = var[¯ y ] (37.0.60)

α2σ2/n + (1 − α)2µ2< σ2/n (37.0.61)

Now simplify it:

(1 − α)2µ2< (1 − α2)σ2/n = (1 − α)(1 + α)σ2/n (37.0.62)

This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side

is negative Therefore we are allowed to assume α < 1, and can divide by 1 − α without disturbing the inequality:

(1 − α)µ2< (1 + α)σ2/n (37.0.63)

µ2− σ 2 /n < α(µ2+ σ2/n) (37.0.64)

Trang 5

The answer is therefore

nµ2− σ 2

nµ 2 + σ 2 < α < 1.

(37.0.65)

 Problem 404 (Not eligible for in-class exams) Assume y= Xβ +ε withε∼ (o, σ2I) If prior knowledge is available that P β lies in an ellipsoid centered around

p, i.e., (P β − p)>Φ−1(P β − p) ≤ h for some known positive definite symmetric matrix Φ and scalar h, then one might argue that theSSE should be mimimized only for those β inside this ellipsoid Show that this inequality constrained mimimization gives the same formula as OLS with a random constraint of the form κ2(Rβ − u) ∼ (o, σ2I) (where R and u are appropriately chosen constants, while κ2 depends on

y You don’t have to compute the precise values, simply indicate how R, u, and κ2

should be determined.)

mixed estimator β = β∗minimizes

( y − Xβ)>( y − Xβ) + κ 4 (Rβ − u)>(Rβ − u) (37.0.66)

= ( y − Xβ)>( y − Xβ) + κ4(P β − p)>Φ−1(P β − p) (37.0.67)

Choose κ 2 such that β∗= (X>X + κ 4 P>Φ−1P ) −1 (X>y + κ 4 P>Φ−1p) satisfies the inequality constraint with equality, i.e., (P β∗− p) > Φ−1(P β∗− p) = h 

Trang 6

Answer Now take any β that satisfies (P β − p)>Φ−1(P β − p) ≤ h Then

( y − Xβ∗)>( y − Xβ∗) = ( y − Xβ∗)>( y − Xβ∗) + κ4(P β∗− p)>Φ−1(P β∗− p) − κ4h (37.0.68)

(because β∗satisfies the inequality constraint with equality)

≤ ( y − Xβ)>( y − Xβ) + κ 4 (P β − p)>Φ−1(P β − p) − κ4h (37.0.69)

(because β∗minimizes ( 37.0.67 ))

≤ ( y − Xβ)>( y − Xβ) (37.0.70)

(because β satisfies the inequality constraint) Therefore β = β∗minimizes the inequality

Trang 7

CHAPTER 38

Stein Rule Estimators

Problem 405 We will work with the regression model y = Xβ +ε withε ∼

N(o, σ2I), which in addition is “orthonormal,” i.e., the X-matrix satisfies X>X = I

• a 0 points Write down the simple formula for the OLS estimatorβˆin this model Can you think of situations in which such an “orthonormal” model is appro-priate?

poly-nomials, or on principal components I guess also if one simply needs the means of a random vector It seems the important fact here is that one can order the regressors; if this is the case then one can always make the Gram-Schmidt orthonormalization, which has the advantage that the jth orthonormalized regressor is a linear combination of the first j ordered regressors 

Trang 8

•b 0 points Assume one has Bayesian prior knowledge that β ∼N(o, τ2I), and

β independent ofε In the general case, if prior information is β∼N(ν, τ2A−1), the Bayesian posterior mean is βˆM = (X>X + κ2A)−1(X>y+ κ2Aν) where κ2=

σ2/τ2 Show that in the present caseβˆM is proportional to the OLS estimateβˆwith proportionality factor (1 − τ2σ+σ2 2), i.e.,

2

τ2+ σ2)

it as a regression with a random constraint Rβ ∼ (o, τ2I) where R = I, which is mathematically the same as considering the know mean vector, i.e., the null vector, as additional observations In either case one gets

(38.0.72)

ˆ

β M = (X>X + κ2A)−1X>y = (X>X + κ2R>R)−1X>y = (I +σ

2

τ 2 I)−1X>y = βˆ(1 − σ

2

τ 2 + σ 2 ), i.e., it shrinks the OLSβˆ= X> y 

• c 0 points Formula (38.0.71) can only be used for estimation if the ratio

σ2/(τ2+ σ2) is known This is usually not the case, but it is possible to estimate both σ2 and τ2+ σ2 from the data The use of such estimates instead the actual values of σ2 and τ2 in the Bayesian formulas is sometimes called “empirical Bayes.”

Trang 9

Show that E[βˆ>βˆ] = k(τ2+ σ2), and that E[y>y−βˆ>βˆ] = (n − k)σ2, where n is the number of observations and k is the number of regressors

(where we now have a k-dimensional identity matrix), therefore E[ βˆ>βˆ] = k(σ 2 + τ 2 ) Furthermore, since M y = M ε regardless of whether β is random or not, σ 2 can be estimated in the usual manner from the SSE : (n − k)σ 2 = E[ ˆ ε>ε ˆ ] = E[ ε ˆ>ε ˆ ] = E[ y>M y ] = E[ y>y −βˆ >βˆ] because

• d 0 points If one plugs the unbiased estimates of σ2and τ2+ σ2from part (c) into (38.0.71), one obtains a version of the so-called “James and Stein” estimator (38.0.73) βˆJ S =βˆ(1 − cy

>y−βˆ>βˆ ˆ

β>βˆ ).

What is the value of the constant c if one follows the above instructions? (This estimator has become famous because for k ≥ 3 and c any number between 0 and 2(n − k)/(n − k + 2) the estimator (38.0.73) has a uniformly lower MSE than the OLS βˆ, where the MSE is measured as the trace of the MSE -matrix.)

• e 0 points The existence of the James and Stein estimator proves that the OLS estimator is “inadmissible.” What does this mean? Can you explain why the

Trang 10

OLS estimator turns out to be deficient exactly where it ostensibly tries to be strong? What are the practical implications of this?

The properties of this estimator were first discussed in James and Stein [JS61], extending the work of Stein [Ste56]

Stein himself did not introduce the estimator as an “empirical Bayes” estimator, and it is not certain that this is indeed the right way to look at it Especially this approach does not explain why the OLS cannot be uniformly improved upon if k ≤ 2 But it is a possible and interesting way to look at it If one pretends one has prior information, but does not really have it but “steals” it from the data, this “fraud” can still be successful

Another interpretation is that these estimators are shrunk versions of unbiased estimators, and unbiased estimators always get better if one shrinks them a little The only problem is that one cannot shrink them too much, and in the case of the normal distribution, the amount by which one has to shrink them depends on the unknown parameters If one estimates the shrinkage factor, one usually does not know if the noise introduced by this estimated factor is greater or smaller than the savings But in the case of the Stein rule, the noise is smaller than the savings

Problem 406 0 points Return to the “orthonormal” model y= Xβ +ε with

ε ∼N(o, σ2I) and X>X = I With the usual assumption of nonrandom β (and

Trang 11

no prior information about β), show that the F-statistic for the hypothesis β = o is

F = βˆ>βˆ/k

( y > y − βˆ>βˆ)/(n−k)

Use equation for the test statistic 

• a 0 points Now look at the following “pre-test estimator”: Your estimate of

β is the null vector o if the value of the F-statistic for the test β = o is equal to or smaller than 1, and your estimate of β is the OLS estimate βˆif the test statistic has

a value bigger than 1 Mathematically, this estimator can be written in the form

where F is the F statistic derived in part (1) of this question, and I(F ) is the “in-dicator function” for F > 1, i.e., I(F ) = 0 if F ≤ 1 and I(F ) = 1 if F > 1 Now modify this pre-test estimator by using the following function I(F ) instead: I(F ) = 0

if F ≤ 1 and I(F ) = 1 − 1/F if F > 1 This is no longer an indicator function, but can be considered a continuous approximation to one Since the discontinuity is removed, one can expect that it has, under certain circumstances, better properties than the indicator function itself Write down the formula for this modified pre-test estimator How does it differ from the Stein rule estimator (38.0.73) (with the value

Trang 12

for c coming from the empirical Bayes approach)? Which estimator would you expect

to be better, and why?

(38.0.75) βˆJ S+=

(

o if 1 − cy>yˆ−βˆ>βˆ

β >βˆ < 0

ˆ

β (1 − cy>yˆ−βˆ>βˆ

β >βˆ ) otherwise

It is equal to the Stein-rule estimator ( 38.0.73 ) when the estimated shrinkage factor (1−cy>yˆ−βˆ>βˆ

β >βˆ )

is positive, but the shrinkage factor is set 0 instead of turning negative This is why it is commonly called the “positive part” Stein-rule estimator Stein conjectured early on, and Baranchik [Bar64] showed that it is uniformly better than the Stein rule estimator: 

• b 0 points Which lessons can one draw about pre-test estimators in general from this exercise?

Stein rule estimators have not been used very much, they are not equivariant and the shrinkage seems arbitrary Discussing them here brings out two things: the formulas for random constraints etc are a pattern according to which one can build good operational estimators And some widely used but seemingly ad-hoc procedures like pre-testing may have deeper foundations and better properties than the halfways sophisticated researcher may think

Trang 13

Problem 407 6 points Why was it somewhat a sensation when Charles Stein came up with an estimator which is uniformly better than the OLS? Discuss the Stein rule estimator as empirical Bayes, shrinkage estimator, and discuss the “positive part” Stein rule estimator as a modified pretest estimator

Ngày đăng: 04/07/2014, 15:20

TỪ KHÓA LIÊN QUAN