Flexible Functional FormSo far we have assumed that the mean of the dependent variable is a linear tion of the explanatory variables.. Forcategorical variables gender, nationality, occup
Trang 1Flexible Functional Form
So far we have assumed that the mean of the dependent variable is a linear tion of the explanatory variables In this chaper, this assumption will be relaxed Wefirst discuss the case where the explanatory variables are categorical variables Forcategorical variables (gender, nationality, occupations, etc.), the concept of linearitydoes not make sense, and indeed, it is customary to fit arbitrary numerical functions
func-of these categorical variables One can do this also if one has numerical variableswhich assume only a limited number of values (such as the number of people in ahousehold) As long as there are repeated observations for each level of these vari-ables, it is possible to introduce a different dummy variables for every level, and inthis way also allow arbitrary functions Linear restrictions between the coefficients
Trang 2of these dummies can be interpreted as the selection of more restricted functionalspaces.
45.1 Categorical Variables: Regression with Dummies and Factors
If the explanatory variables are categorical then it is customary to fit arbitraryfunctions of these variables This can be done with the use of dummy variables, or
by the use of variables coded as “factors.” If there are more than two categories,you need several several regressors taking only the values 0 and 1, which is why theyare called “dummy variables.” One regressor with several levels 0,1,2, etc is toorestrictive The “factor” data type in R allows to code several levels in one variable,which will automatically be expanded into a set of dummy variables Therefore let
us first discuss dummy variables
If one has a categorical variable which has j possible outcomes, the simplest andmost obvious thing to do would be to generate j regressors into the equation, eachtaking the value 1 if the observation has this level, and the value 0 otherwise But ifone does this, one has to leave the intercept out of the regression, otherwise one getsperfect multicollinearity Usually in practice one keeps the intercept and omits one
of the dummy variables This makes it a little more difficult to interpret the dummyvariables
Trang 3Problem 441 In the intermediate econometrics textbook [WW79], the ing regression line is estimated:
follow-(45.1.1) bt= 0.13 + 068yt+ 0.23wt+ˆt,
where bt is the public purchase of Canadian government bonds (in billion $), yt isthe national income, and wt is a dummy variable with the value wt= 1 for the waryears 1940–45, and zero otherwise
• a 1 point This equation represents two regression lines, one for peace andone for war, both of which have the same slope, but which have different intercepts.What is the intercept of the peace time regression, and what is that of the war timeregression line?
Answer In peace, w t = 0, therefore the regression reads b t = 0.13 + 068y t + ˆ t , therefore the intercept is 13 In war, w t = 1, therefore b t = 0.13 + 068y t + 0.23 + ˆ t , therefore the intercept
• b 1 point What would the estimated equation have been if, instead of wt,they had used a variable ptwith the values pt= 0 during the war years, and pt= 1otherwise? (Hint: the coefficient for ptwill be negative, because the intercept in peacetimes is below the intercept in war times)
Answer Now the intercept of the whole equation is the intercept of the war regression line, which is 36, and the coefficient of p is the difference between peace and war intercepts, which is
Trang 4Answer From b t = 0.13 + 068y t + 0.23w t + ˆ t follows 1000b t = 130 + 068 · 1000y t + 230w t +
1000 ˆ t , or
(45.1.4) b(m)t = 130 + 068yt(m)+ 230w t + ˆ(m)t ,
where b(m)t is bond sales in millions (i.e., b(m)t = 1000b t ), and y(m)t is national income in millions
Trang 5Problem 442 5 points Assume you run a time series regressiony= Xβ +ε,but you have reason to believe that the values of the parameter β are not equal in alltime periods t What would you do?
Answer Include dummies, run separate regressions for subperiods, use a varying parameter
to the response variable The presence of interaction can be modeled by includingproducts of the dummy variables with the response variable with whom interactionexists
How do you know the interpretation of the coefficients of a given set of dummies?Write the equation for every category separately E.g [Gre97, p 383]: Winter
Trang 6y = β1+ β5x, Spring y = β1+ β2+ β5x Summer y = β1+ β3+ β5x, Autumn
y = β1+ β4+ β5x I.e the overall intercept β1 is the intercept in Winter, thecoefficient for the first seasonal dummy β2 is the difference between Spring andWinter, that for the second dummy β3difference between Summer and Winter, and
β4 the difference between Autumn and Winter
If the slope differs too, do
multi-An alternative to using dummy variables is to use factor variables If one includes
a factor variable into a regression formula, the statistical package converts it into aset of dummies Look at Section 22.5 for an example how to use factor variablesinstead of dummies in R
45.2 Flexible Functional Form for Numerical Variables
Here the issue is: how to find the right transformation of the explanatory ables before running the regression? Each of the methods to be discussed has asmoothing parameter
Trang 7vari-To fix notation, assume for now that only one explanatory variable x is given andyou want to estimate the modely= f (x) +ε with the usual assumptionε∼ o, σ2I.But whereas the regression model specified that f is an affine function, we allow f
to be an element of an appropriate larger function space The size of this space ischaracterized by a so-called smoothing parameter
45.2.1 Polynomial Regression The most frequently used method is nomial regression, i.e., one chooses f to be a polynomial of order m (i.e it has mterms, including the constant term) or degree m − 1 (i.e the highest power is xm−1)
poly-f (x) = θ0+ θ1x + · · · + θm−1xm−1
Motivation: This is a seamless generalization of ordinary least squares, sinceaffine functions are exactly polynomials of degree 1 (order 2) Taylor’s theorem saysthat any f ∈ Wm[a, b] can be approximated by a polynomial of order m (degree
m − 1) plus a remainder term which can be written as an integral involving the mthderivative, see [Eub88, (3.5) on p 90] The Weierstrass Approximation Theorem saythat any continuous function over a closed and bounded interval can be uniformlyapproximated by polynomials of sufficiently high degree
Here one has to decide what degree to use, the degree of the polynomial playshere the role of the smoothing parameter
Some practical hints:
Trang 8For higher degree polynomials don’t use the “power basis” 1, x, x2, , xm−1, butthere are two reasonable choices Either one can use Legendre polynomials [Eub88,(3.10) and (3.11) on p 54], which are obtained from the power basis by Gram-Schmidt orthonormalization over the interval [a, b] This does not make the designmatrix orthogonal, but at least one should expect it not to be too ill-conditioned,and the roots and the general shape of Legendre polynomials is well-understood Asthe second main choice one may also select polynomials that make the design-matrixitself exactly orthonormal The Splus-function poly does that.
The jth Legendre polynomial has exactly j real roots in the interval [Dav75,Chapter X], [Sze59, Chapter III] The orthogonal polynomials probably have a sim-ilar property This gives another justification for using polynomial regession, which
is similar to the justification one sometimes reads for using Fourier-series: The datahave high-frequency and low-frequency components, and one wants to filter out thelow-frequency components
In practice, polynomials do not always give a good fit There are better tives available, which will be discussed in turn
alterna-45.2.2 The Box-Cox Transformation An early attempt used in rics was to use a family of functions which is not as complete as the polynomials butwhich ecompasses many functional forms encountered in Economics These functions
Trang 9Economet-are only defined for x > 0 and have the form
(x λ −1
λ if λ 6= 0ln(x) if λ = 0[DM93, p 484] have a plot with the curves for λ = 1.5, 1, 0.5, 0, −0.5, and −1.They point out some serious disadvantage of this transformation: if λ 6= 0, B(x, λ)
is bounded eihter from below or above For λ < 0, B(x, λ) cannot be greater than
−1/λ, and for λ > 0, it cannot be less than −1/λ
About the Box-Cox transformation read [Gre97, 10.4]
45.2.3 Kernel Estimates For nonparametric estimates look at [Loa99], ithas the R-package locfit
Figure 1.1 is a good example: it is actuarial data, which are roughly fitted by
a straight line, but a better idea of the accelerations and decelerations can be veryuseful for a life insurance company
Chapter 1 gives a historical overview: Spencer’s rule from 1904 was designed forcomputational convenience (for hand-calculations), and it reproduces polynomials
up to the 3rd degree Figure 2.1 illustrates how local regression is done Pp 18/19:emphasis on fitted values, not on the parameter estimates There are two importantparameters: the bandwidth and the degree of the polynomial To see the effects
Trang 10of bandwidth, see the plots on p 21: using our data we can do plots of the sortplot(locfit(r~year,data=uslt,alpha=0.1,deg=3),get.data=T) and then varyalpha and deg.
Problem 443 What kind of smoothing would be best for the time series of thevariable r (profit rate) in dataset uslt?
Problem 444 Locally constant smooths are not good at the edges, and also not
at the maxima and minima of the data Why not?
The kernel estimator can be considered a local fit of a constant Straight linesare better, and cubic parabolas even better Quadratic ones not as good
The birth rate data which require smoothing with a varying bandwidth areinteresting, see Simonoff p 157, description in the text on p 158
45.2.4 Regression Splines About the word “spline,” [Wah90, p vii] writes:
“The mechanical spline is a thin reedlike strip that was used to draw curves needed
in the fabrication of cross sections of ships’ hulls Ducks or weights were placed onthe strip to force it to go through given points, and the free portion of the stripwould assume a position in space that minimized the bending energy.”
One of the drawbacks of polynomial regression is that its fit is global Onemethod to provide for local fits is to fit a piecewise polynomial A spline is a piecewise
Trang 11polynomial of order m (degree m − 1) spliced together at given “knots” so that allderivatives coincide up to and including the m − 2nd one Polynomial splines aregeneralizations of polynomials: whereas one can characterize polynomials of order m(degree m − 1) as functions whose m − 1st derivative is constant, polynomial splinesare functions whose m − 1st derivative is piecewise constant This is the smoothestway to put different polynomials together Compare the Splus-function bs.
If one starts with a cubic spline, i.e., a spline of order 4, and postulates inaddition that the 2nd derivative is zero outside the boundary points, one obtainswhat is called a “natural cubic spline”; compare the Splus-function ns There isexactly one natural spline going through n datapoints
One has to choose the order and the location of the knots The most popularare cubic splines, and higher orders do not seem to add much, therefore it is moreimportant to concentrate on a good selection of the knots here are some guidelineshow to choose knots, taken from [Eub88, p 357]:
For m = 2, linear splines, place knots at points where the data exhibit a change
in slope
For m = 3, quadratic splines, locate knots near local maxima, minima or tion points in the data
Trang 12inflec-For m = 4, cubic splines, arrange the knots so that they are close to inflexionpoints in the data and not more than one extreme point (maximum or minimum)and one inflection point occurs between any two knots.
It is also possible to determine the number of knots and select their location so
as to optimize the fit But this is a hairy minimization problem; [Eub88, p 362]gives some shortcuts
Extensions: Sometime one wants knots which are not so smooth, this can beobtained by letting several knots coincide Or one wants polynomials of differentdegrees in the different segments
[Gre97, pp 389/90] has a nice example for a linear spline Each of 3 differentage groups has a different slope and a different intercept: t < t∗, t∗ ≤ t < t∗∗, and
t∗∗ ≤ t These age groups are coded by the matrix D consisting of two dummyvariables, one for t ≥ t∗ and one for t ≥ t∗∗ I.e, D =hd(1) d(2)i where d(1)j = 1
if age tj ≥ t∗ and d(2)j = 1 if tj ≥ t∗∗ Throwing D into the regression allows fordifferent intercepts in these different age groups
In order to allow for the slopes with respect to t to vary too, we need a matrix
E, again consisting of 2 columns, so that ej1 = tj if tj ≥ t∗ and 0 otherwise; and
ej2 = tj if if tj ≥ t∗∗, and 0 otherwise Each column of E is the correspondingcolumn of D element-wise multiplied with t, i.e., E =hd(1) ∗ t d(2)∗ ti
Trang 13If one then writes the model asy= Dγ + Eδ + Xβ one gets an unconstrained
model with 3 different slopes and 3 different intercepts Assume for example there
are 3 observations in the first age group, 2 in the second, and 4 in the third, then
(45.2.3) y= β1+ β2t + β3x + γ1d(1)+ δ1d(1)∗ t + γ2d(2)+ δ2d(2)∗ t +ε
This is how [Gre97, equation (8.3) on p 389] should be understood The jth
observation has the form
(45.2.4) y = β1+ β2tj+ β3xj+ γ1d(1)+ δ1d(1)tj+ γ2d(2)+ δ2d(2)tj+εj
Trang 14An observation at the year t∗ has, according to the formula for ≥ t∗, the form(45.2.5) y∗= β1+ β2t∗+ β3x∗+ γ1+ δ1t∗+ε∗
but had the formula for < t∗ still applied, the equation would have been
Again equality of these two representations requires γ2+ δ2t∗∗= 0
These two constraints can be written as
or γ = −W δ, where W =t∗ 0
0 t
Trang 15
Plugging this intoy= Dγ + Eδ + Xβ givesy= −DW δ + Eδ + Xβ = F δ + Xβwhere F is a dummy matrix of the form
of t; but in other dummy variable settings such a categorical variable is necessary.Now the regressand is not necessarilyy but may be a transformation g(y), and the
Trang 16k regressors have the form fi(Z) where the functions fi are linearly independent.For instance f1(Z) = z11 z21 >
may pick out the first column of Z, and
f2(Z) = z2
11 z2
21 >
the square of the first column The functions g and fi
define the relationship between the given economic variables and the variables in theregression [Gre97, Definition 81 on p 396] says something about the relationshipbetween the parameters of interest and the regression coefficients: if the k regressioncoefficients β1, , βk can be written as k one-to-one possibly nonlinear functions ofthe k underlying parameters θ1, , θk, then the model is intrinsically linear in θ
[Gre97, p 391/2] brings the example of a regression with an interaction term:(45.2.11) y= ιβ1+ sβ2+ wβ3+ s ∗ wβ4+ε
Say the underlying parameters of interest are ∂ E[yt ]
∂st = β2+ β4wt, ∂ E[yt ]
∂wt = β3+ β4st,and the second derivative ∂2E[yt ]
∂st∂wt = β4 Here the first parameter of interest depends
on the value of the explanatory variables, and one has to select a value; usually onetakes the mean or some other central value, but for braking distance some extremevalue may be more interesting
[Gre97, example 8.4 on p 396] is a maximum likelihood model that can also
be estimated as an intrinsically linear regression model I did not find the referencewhere he discussed this earlier, perhaps I have to look in the earlier edition Here
Trang 17maximum likelihood is far better Greene asks why and answers: least squares doesnot use one of the sufficient statistics.
[Gre97, example 8.5 on p 397/8] starts with a CES production function, thenmakes a Taylor development, and this Taylor development is an intrinsically linearregression of the 4 parameters involved Greene computes the Jacobian matrix nec-essary to get the variances He compares that with doing nonlinear least squares onthe production function directly, and gets widely divergent parameter estimates
45.2.5 Smoothing Splines This seems the most promising approach If oneestimates a function by a polynomial of order m or degree m − 1, then this meansthat one sets the mth derivative zero An approximation to a polynomial would
be a function whose mth derivative is small We will no longer assume that thefitting functions are themselves polynomials, but we will assume that f ∈ Wm[a, b]which means f itself and its derivatives up to and including the m − 1st derivativeare absolutely continuous over a closed and bounded interval [a, b], and the mthderivative is square integrable over [a, b]
If we allow such a general f , then the estimation criterion can no longer bethe minimization of the sum of squared errors, because in this case one could simplychoose an interpolant of the data, i.e., a f which satisfies f (xi) =yifor all i Instead,the estimation criterion must be a constrained or penalized least squares criterion(analogous to OLS with an exact or random linear constraint) which has a penalty for