Class Notes in Statistics and Econometrics Part 28 ppt

If convergence occurs, this minimum is usually a local mum, and often one is not sure whether there is not another, better, local minimumsomewhere else.mini-At every step, the computer m

Trang 1

θ with f (ˆθ) ≤ f (θ) for all θ.

The numerical methods to find this minimum argument are usually recursive:the computer is given a starting value θ0, uses it to compute θ1, then it uses θ1 tocompute θ2, and so on, constructing a sequence θ1, θ2, that converges towards

Trang 2

a minimum argument If convergence occurs, this minimum is usually a local mum, and often one is not sure whether there is not another, better, local minimumsomewhere else.

mini-At every step, the computer makes two decisions, which can be symbolized as

Here di, a vector, is the step direction, and αi, a scalar, is the step size The choice ofthe step direction is the main characteristic of the program Most programs (notableexception: simulated annealing) always choose directions at every step along whichthe objective function slopes downward, so that one will get lower values of theobjective function for small increments in that direction The step size is then chosensuch that the objective function actually decreases In elaborate cases, the step size

is chosen to be that traveling distance in the step direction which gives the bestimprovement in the objective function, but it is not always efficient to spend thismuch time on the step size

Let us take a closer look how to determine the step direction If g>

i = (g(θi))>

is the Jacobian of f at θi, i.e., the row vector consisting of the partial derivatives of

f , then the objective function will slope down along direction diif the scalar product

g>i di is negative In determining the step direction, the following fact is useful: Allvectors d for which g>d < 0 can be obtained by premultiplying the transpose of the

Trang 3

negative Jacobian, i.e., the negative gradient vector −gi, by an appropriate positivedefinite matrix Ri.

Problem492 4 points Here is a proof for those who are interested in this issue:Prove that g>d < 0 if and only if d = −Rg for some positive definite symmetricmatrix R Hint: to prove the “only if ” part use R = I − gg>/(g>g) − dd>/(d>g).This formula is from [Bar74, p 86] To prove that R is positive definite, note that

R = Q + S with both Q = I − gg>/(g>g) and S = −dd>/(d>g) nonnegativedefinite It is therefore sufficient to show that any x 6= o for which x>Qx = 0satisfies x>Sx > 0

Answer If R is positive definite, then d = −Rg clearly satisfies d>g < 0 Conversely, for any d satisfying d>g < 0, define R = I − gg>/(g>g) − dd>/(d>g) One immediately checks that

d = −Rg To prove that R is positive definite, note that R is the sum of two nonnegative definite matrices Q = I − gg>/(g>g) and S = −dd>/(d>g) It is therefore sufficient to show that any

x 6= o for which x>Qx = 0 satisfies x>Sx > 0 Indeed, if x>Qx = 0, then already Qx = o, which means x = gg>x

Trang 4

Many important numerical methods, the so-called gradient methods [KG80, p.430] use exactly this principle: they find the step direction diby premultiplying −gi

by some positive definite Ri, i.e., they use the recursion equation

The most important ingredient here is the choice of Ri We will discuss two “natural”choices

The choice which immediately comes to mind is to set Ri= I, i.e., di= −αigi.Since the gradient vector shows into the direction where the slope is steepest, this iscalled the method of steepest descent However this choice is not as natural as onemight first think There is no benefit to finding the steepest direction, since one caneasily increase the step length It is much more important to find a direction whichallows one to go down for a long time—and for this one should also consider how thegradient is changing The fact that the direction of steepest descent changes if onechanges the scaling of the variables, is another indication that selecting the steepestdescent is not a natural criterion

The most “natural” choice for Ri is the inverse of the “Hessian matrix” G(θi),which is the matrix of second partial derivatives of f , evaluated at θi This is calledthe Newton-Raphson method If the inverse Hessian is positive definite, the NewtonRaphson method amounts to making a Taylor development of f around the so far

Trang 5

best point θi, breaking this Taylor development off after the quadratic term (sothat one gets a quadratic function which at point θi has the same first and secondderivatives as the given objective function), and choosing θi+1 to be the minimumpoint of this quadratic approximation to the objective function.

Here is a proof that one accomplishes all this if Ri is the inverse Hessian Thequadratic approximation (second order Taylor development) of f around θi is

which is the above procedure with step size 1 and Ri= (G(θi))−1

Theorem 55.0.1 Let G be a n × n positive definite matrix, and g a n-vector.Then the minimum argument of the function

(55.0.15) q : z 7→ g>z +1

2z

>Gz is x = −G−1g

Trang 6

Proof: Since Gx = −g, it follows for any z that

This is minimized by z = x

The Newton-Raphson method requires the Hessian matrix [KG80] recommend

to establish mathematical formulas for the derivatives which are then evaluated at

θi, since it is very tricky and unprecise to compute derivatives and the Hessiannumerically The analytical derivatives, on the other hand, are time consuming andthe computation of these derivatives may be subject to human error However thereare computer programs which automatically compute such derivatives Splus, forinstance, has the deriv function which automatically constructs functions which arethe derivatives or gradients of given functions

The main drawback of the Newton-Raphson method is that G(θi) is only positivedefinite if the function is strictly convex This will be the case when θi is close to

a minimum, but if one starts too far away from a minimum, the Newton-Raphsonmethod may not converge

Trang 7

There are many modifications of the Newton-Raphson method which get aroundcomputing the Hessian and inverting it at every step and at the same time ensurethat the matrix Ri is always positive definite by using an updating formula for Ri,which turns Ri, after sufficiently many steps into the inverse Hessian These areprobably the most often used methods A popular one used by the gauss software

is the Davidson-Fletcher-Powell algorithm

One drawback of all these methods using matrices is the fact that the size ofthe matrix Ri increases with the square of the number of variables For problemswith large numbers of variables, memory limitations in the computer make it nec-essary to use methods which do without such a matrix A method to do this is the

“conjugate gradient method.” If it is too difficult to compute the gradient vector,the “conjugate direction method” may also compare favorably with computing thegradient numerically

Trang 9

CHAPTER 56

Nonlinear Least Squares

This chapter ties immediately into chapter 55 about Numerical Minimization.The notation is slightly different; what we called f is now called SSE, and what

we called θ is now called β A much more detailed discussion of all this is given

in [DM93, Chapter 6], which uses the notation x(β) instead of our η(β) [Gre97,Chapter 10] defines the vector function η(β) by ηt(β) = h(xt, β), i.e., all elements

of the vector function η have the same functional form h but differ by the values ofthe additional arguments xt [JHG+88, Chapter (12.2.2)] set it up in the same way

as [Gre97], but they call the function f instead of h

Trang 10

An additional important “natural” choice for Ri is available if the objectivefunction has the nonlinear least squares form

.ηn(β1, β2, · · · , βk)

Trang 11

Instead of the linear least squares model

y1= x11β1+ x12β2+ · · · + x1kβk+ε1(56.0.22)

y2= x21β1+ x22β2+ · · · + x2kβk+ε2

(56.0.23)

.(56.0.24)

.(56.0.28)

yn= ηn(β1, β2, · · · , βk) +εn(56.0.29)

Usually there are other independent variables involved in η which are not shown hereexplicitly because they are not needed for the results proved here

Trang 12

Problem 493 4 points [Gre97, 10.1 on p 494] Describe as precisely as youcan how you would estimate the model

αxβi) There are only two parameters to minimize over, and for every given β it is easy to get the

α which minimizes the objective function with the given β fixed, namely, this is

After you have the point estimates ˆ α and ˆ β write yt= ηt+εt and construct the pseudoregressors

∂ηt/∂α = xβtˆ and ∂ηt/∂β = α(log xt)xβtˆ If you regress the residuals on the pseudoregressors you will get parameter estimates zero (if the estimates ˆ α and ˆ β are good), but you will get the right standard errors.

Trang 13

Next we will derive the first-order conditions, and then describe how to runthe linearized Gauss-Newton regression For this we need some notation For anarbitrary but fixed vector βi(below it will be the ith approximation to the nonlinearleast squares parameter estimate) we will denote the Jacobian matrix of the function

η evaluated at βi with the symbol X(βi), i.e., X(βi) = ∂η(β)/∂β>(βi) X(βi) iscalled the matrix of pseudoregressors at βi The mh-th element of X(βi) is

∂βh(βi),i.e., X(βi) is the matrix of partial derivatives evaluated at βi

but X(βi) should first and foremost be thought of as the coefficient matrix of thebest linear approximation of the function η at the point βi In other words, it is thematrix which appears in the Taylor expansion of η(β) around βi:

(56.0.34) η(β) = η(β) + X(β)(β − β ) + higher order terms

Trang 14

Now let us compute the Jacobian of the objective function itself

(56.0.35) SSE= (y− η(β))>(y− η(β)) =εˆ>εˆ where εˆ=y− η(β)

This Jacobian is a row vector because the objective function is a scalar function Weneed the chain rule (C.1.23) to compute it In the present situation it is useful tobreak our function into three pieces and apply the chain rule for three steps:

(56.0.36)

∂SSE/∂β>= ∂SSE/∂εˆ>·∂εˆ/∂η>·∂η/∂β> = 2εˆ>·(−I)·X(β) = −2(y−η(β))>X(β)

Problem 494 3 points Compute the Jacobian of the nonlinear least squaresobjective function

(56.0.37) SSE= (y− η(β))>(y− η(β))

where η(β) is a vector function of a vector argument Do not use matrix tion but compute it element by element and then verify that it is the same as equation(56.0.36)

Trang 15

= −2Xt

(yt− ηt(β))∂ηt

∂βh(56.0.40)

Trang 16

The gradient vector is the transpose of (56.0.36):

(56.0.44) g(β) = −2X>(β)(y− η(β))

Setting this zero gives the first order conditions

(56.0.45) X>(β)η(β) = X>(β)y

It is a good idea to write down these first order conditions and to check whether some

of them can be solved for the respective parameter, i.e., whether some parameterscan be concentrated out

Plugging (56.0.34) into (56.0.21) and rearranging gives the regression equation(56.0.46) y− η(βi) = X(βi)(β − βi) + error term

Here the lefthand side is known (it can be written ˆεi, the residual associated with thevector βi), we observey, and βi is the so far best approximation to the minimumargument The matrix of “pseudoregressors” X(βi) is known, but the coefficent

δi= β − βiis not known (because we do not know β) and must be estimated Theerror term contains the higher order terms in (56.0.34) plus the vector of randomdisturbances in (56.0.21) This regression is called the Gauss-Newton regression(GNR) at βi [Gre97, (10-8) on p, 452] writes it as

(56.0.47) y− η(β) + X(β )β = X(β)β + error term

Trang 17

Problem495 6 points [DM93, p 178], which is very similar to [Gre97, (10-2)

on p 450]: You are estimating by nonlinear least squares the model

(56.0.48) yt= α + βxt+ γztδ+εt or y= αι + βx + γzδ+ε

You are using the iterative Newton-Raphson algorithm

• a In the ith step you have obtained the vector of estimates







Write down the matrix X of pseudoregressors, the first order conditions, the Newton regression at the given parameter values, and the updated estimate ˆβi+1

Gauss-Answer The matrix of pseudoregressors is, column by column,

Trang 18

Write the first order conditions ( 56.0.45 ) in the form X>(β)(y − η(β)) = o which gives here

X

t xt(yt− α − βxt − γz δ

t ) = 0 (56.0.53)

γXt log(zt)ztδ(yt− α − x tβ − ztδγ) = 0 (56.0.55)

which is very similar to [ Gre97 , Example 10.1 on p 451] These element-by-element first order conditions can also be easily derived as the partial derivatives of SSE =Pt(yt− α − βxt − γz δ

t ) 2 The Gauss-Newton regression ( 56.0.46 ) is the regression of the residuals on the columns of the Jacobian.

(56.0.56) y − ˆ α − ˆ βxt − ˆ γzˆ= a + bxt + czˆ+ dˆ γ log(zt)zˆ+ error term

Trang 19

and from this one gets an updated estimate as

β + ˆ b ˆ

γ + ˆ c ˆ

•b How would you obtain the starting value for the Newton-Raphson algorithm?Answer One possible set of starting values would be to set ˆ δ = 1 and to get ˆ α, ˆ β, and ˆ γ from

The Gauss-Newton algorithm runs this regression and uses the OLS estimate ˆδi

of δi to define βi+1= βi+ ˆδi The recursion formula is therefore

(56.0.59) βi+1= βi+ ˆδi= βi+ ((X(βi))>X(βi))−1(X(βi))>(y− η(βi)).The notation (η(β))> = η>(β) and (X(β))> = X>(β) makes this perhaps a littleeasier to read:

(56.0.60) β = β + (X>(β)X(β))−1X>(β)(y− η(β))

Trang 20

This is [Gre97, last equation on p 455].

A look at (56.0.44) shows that (56.0.60) is again a special case of the generalprinciple (55.0.12), i.e., βi+1 = βi − αiRigi, with Ri = X>(βi)X(βi)−1 and

αi= 1/2

About the bad convergence properties of Gauss-Newton see [Thi88, p 213–215].Although the Gauss-Newton regression had initially been introduced as a numer-ical tool, it was soon realized that this regression is very important Read [DM93,Chapter 6] about this

If one runs the GNR at the minimum argument of the nonlinear least squaresobjective function, then the coefficients estimated by the GNR are zero, i.e., theadjustments to the minimum argument ˆδ = βi+1− βiare zero

How can a regression be useful whose outcome we know beforehand? Severalpoints: If the estimated parameters turn out not to be zero after all, then β∗ wasnot really a minimum I.e., the GNR serves as a check of the minimization procedurewhich one has used One can also use regression diagnostics on the GNR in order toidentify influential observations The covariance matrix produced by the regressionprintout is an asymptotic covariance matrix of the NLS estimator One can checkfor collinearity If β∗ is a restricted NLS estimate, then the GNR yields Lagrangemultiplier tests for the restriction, or tests whether more variables should be added,

or specification tests

Trang 21

Properties of NLS estimators: If X0is the matrix of pseudoregressors computed

at the true parameter values, one needs the condition plim1nX>0X0= Q0exists and

is positive definite For consistency we need plim1

nX>0ε = 0, and for asymptoticnormality √1

nX>0ε→N(o, σ2Q0) ˆσ2=SSE( ˆβ)/n is a consistent estimate of σ2(adegree of freedom correction, i.e., dividing by n − k instead of n, has no virtue heresince the results are valid only asymptotically) Furthermore, ˆσ2 X>(β)X(β)−1 is

a consistent estimate of the asymptotic covariance matrix

56.1 The J TestStart out with two non-nested hypotheses about the data:

H0:y= η0(β) +ε0(56.1.1)

Định dạng
Số trang	26
Dung lượng	397,43 KB