In this paper we consider tile finite sample performance of two popular estimators for additive models: tile bacl~fitting estimators of Hastie and Tibshirani 1990 and tile integration es
Trang 1Socicdad de E s t a d ( s t i c a c I n v e s t i g a c i d n Opcrativa
Test (1999) Vot 8, No 2, pp 419 458
Integration and backfitting methods in additive models- finite sample properties and comparison
K e y W o r d s : Additive models, curse of dimensionality, dimensionality reduction, model choice, nonparametric regression
Trang 2420 S b~crlich, O.B L m t o n and W Hiirdle
structure is desirable from a purely statistical point of view because it cir- cumvents the curse of dimensionality T h e r e has been much theoretical and applied work in econometrics on semiparametric and n o n p a r a m e t r i c meth- ods, see H/irdle and Linton (1994), Newey (1990), and Powell (1994) for bibliography and discussion Some recent work has shown t h a t additivity has i m p o r t a n t implications for tile rate at which certain components can
be estimated In this paper we consider tile finite sample performance of two popular estimators for additive models: tile bacl~fitting estimators of Hastie and Tibshirani (1990) and tile integration estimators of Linton and Nielsen (1995)
Let (X, Y) be a r a n d o m variable with X of dimension d and Y a scalar Consider tile estimation of tile regression function re(x) - E (Y [ X - x) based on a r a n d o m sample {(Xi, Y/)}L1 from this population Stone (1980, 1982) and Ibragimov and Hasminskii (1980) showed that tile optimal rate for estimating m is n e/(2e+d) with g an index of smoothness of m An additive structure for m is a regression function of t h e form
d
cv=l
are one-dimensional nonparametric functions operating on each element of
tile vector or predictor variables with E {m~(X~)} = 0 Stone (1985, 1986) showed that for such regression curves t h e optimal rate for estimating m is the one-dimensional rate of convergence with n U(2~+1) Thus one speaks
of dimensionality reduction t h r o u g h additive modelling
In practice, tile backfitting procedures proposed in B r e i m a n and Fried-
m a n (1985) and Buja, Hastie and Tibshirani (1989) are widely used to estimate the additive components The latter (equation (18)) consider tile problem of finding the projection of m onto the space of additive flmctions representing the right h a n d side of (1) Replacing population by sample, this leads to a system of normal equations with n d x n d dimensions To solve this in practice, the backfitting or Gauss-Seidel algorithm, is usually used, see Venables and Ripley (1994) This technique is iterative and depends
on the starting values and convergence criterion It converges very fast but has, in comparison with tile direct solntion of tile large linear system, tile slight disadvantage of a more complicated "hat m a t r i x " , see H/irdle and Hall (1993) These m e t h o d s have been evaluated on numerous datasets
Trang 3Integration and back~tting methods in additive models 421
and have been refined quite considerably since their introduction
Recently, Linton and Nielsen (1995), Tjostheim and Auestad (1994), and Newey (1994) have independently proposed an alternative procedure for estimating m~ based on integration of a standard kernel estimator It exploits the following idea Suppose that re(x, z) is any bivariate func- tion, and consider the quantities #l(x) f m ( x , z ) d Q ~ ( z ) and #2(z)
f re(x, z)dQ~ (x), where Q~ is a probability measure If re(x, z) = ml(x) +
m2(z), then ,1(') and ,2(') are ml(.) and mu(.), respectively, up to a con- stant In practice one replaces m by an estimate and integrates with respect
to some known measure The procedure is explicitly defined and its asymp- totic distribution is easily derived: it converges at the one-dimensional rate and satisfies a central limit theorem This estimation procedure has been extended to a number of other contexts like estimating the derivatives (Severance-Lossin and Sperlich, 1997), to the generalized additive model (Linton and Hgrdle, 1996), to dependent variable transformation models (Linton, Chen, Wang, and Hgrdle, 1997), to econometric time series models (Masry and Tjcstheim, 1995, 1997), to panel data models (Porter, 1996), and to hazard models with time varying covariates and right censoring (Nielsen, 1996) In this wide variety of sampling schemes and procedures the asymptotics have been derived because of the explicit form of the es- timator By contrast, backfitting or backfitting-like methods have until recently eluded theoretical analysis, until Opsomer and Ruppert (1997) provided conditional mean squared error expressions albeit under rather strong conditions on the smoothing matrices and design More recently, Linton, Mammen, and Nielsen (1998) has established a central limit the- orem for a modified form of backfitting which uses a bivariate integration step as well as the iterative updating of the other methods
The purpose of this paper is to investigate the finite sample performance
of the standard backfitting estimator and the integration estimator
Trang 4422 S Sperlich O.B Linton and W Hiirdle
d-dimensional explanatory variable by p(x) with marginals p~(x~), a
1 , , d We shall sometimes partition Xi - (Xei, X~i) r and x - (za, x ~ r
into scalar and d 1-dimensional subvectors respectively calling xn t h e di- rection of interest and x~ the direction not of interest; denote by p~(x~)
tile marginal density of tile vector X~i In t h e following we assmne tile following additive form for tile regression flmction
Ou min~-.{yi_pq(Oo, O l ; X i _ z ) } 2 H K c ~ z~ (2.1)
where K~ and hc~, a - 1 , , d, are scalar kernels and b a n d w i d t h s respec- tively, while Pv(Oo, Oz; t) is a ( q - 1) #~ order polynomial in tile vector t with coefficiems 00, 01 for which Pq (00, 0,; 0) 00 and e.g P2 (00, 01; t) 00 + 01 t Let ~ ( x ) = &(x) U n d e r regularity conditions, see R u p p e r t and W a n d (1995) for example, the local polynonfial estimator satisfies
where h = ( I l d = l h~) 1/d is t h e geometric average of t h e bandwidths, #q(K)
and v ( K ) are constants depending only on the kernels, while v(x) = cr2(x)/p(x) and b(x) is the bias function depending on derivatives of m, and possibly p, up to and including order q The (mean squared error) optimal b a n d w i d t h is of order n 1/(2q+d) for which the asymptotic m e a n squared error is of order n 2q/(2q+d), see Hgrdle and Linton (1994), which reflects the curse of dimensionality as d increases, the rate of convergence decreases
W h e n m(.) satisfies the additive model structure, we can estimate re(x) with a b e t t e r rate of convergence by imposing these restrictions Let
<~(x~) =/'~h~(X~,x~)dO.n(x 9 "c, (2.3)
a
Trang 5Integration and backfitting methods in additive models 4 2 3
where 3 is an estimate of c, while Q~ (.) is some easy to c o m p u t e probability measure The nlost convenient choice of @~(.) is the empirical measure of
X n
{ ~i}i=l, which converges to the p o p u l a t i o n distribution It changes the integral in (2.3) to a sum over terms evaluated at Xai and implies for the constant c = E ( Y ) The latter can be estimated root-n consistently
by the sample mean n -Z ~ i ' ~ Y~; an alternative estimate, which is not necessarily root- n consistent, is provided by n -~ ~ i ~ ~hc~(Xc~) W h a t e v e r
the estimates of c and ~ ( x ~ ) , we reestimate re(x) by
d
c~ 1 Linton and HSrdle (1996) derived the pointwise a s y m p t o t i c properties of the empirical integration versions of ihe (xe) and r~ih~ (x) To simplify matters,
we set ha = hi and K s = K, while I l s # e K s = L and h s = h2 for all /3 ~ c~ Under their regularity conditions,
where bo(x) = ~ - ~ b ~ o ( x ~ ) and vo(x) = ~ - ~ v ~ o ( x ~ ) w i t h b~o(x~) =
are constants depending only on the kernel K B y choosing ]tl o( n 1/(2q+1)
one can achieve the optimal rate of convergence i.e., mean squared error
of order n -2q/(2q+l), which is independent of the dimensions d See also
Linton and Nielsen (1995) and Severance-Lossin and Sperlich (1997)
R e m a r k 2.1 The b a n d w i d t h s h z ~ 9 ~ h d should be chosen differently as
we discuss fm'ther in the simulation section To achieve the optimal rate of convergence, we nmst impose some restrictions on the b a n d w i d t h sequences This condition, which corresponds to (AT) in Linton and HSrdle (1996) is needed for bias reduction of the nuisance components In Section 3 we will examine some b a n d w i d t h selection methods
Trang 64 2 4 S Sperlich, O.B Linton and W Hiirdle
which can be formulated inside a Hilbert space framework: let [?'{YX, (', "}]
b e the Hilbert space of r a n d o m variables which are functions of Y and X with {a,b} = E(ab), let also [ ~ x , (','}], and [~x~, (','}], a = 1 , , d be corresponding subspaces, where for example 7 t x , contains only fnnctions of X~ Tile above p r o b l e m is equivalent to finding tile element of tlle subspace
~ X l O - - 9 ~ x a closest to a point Y E ~ Y x or equivalently the point
m E 7{x By tile projection theorem, there exists a unique solution which
is characterized by tile following first order conditions
a 1 , , d, which leads to tile formal representation:
x ~ (~-i) }
Trang 7Integration and back~tting methods in additive models 425
until some prespecified tolerance is reached The estimator is linear in y,
b u t the algorithm only converges under strong restrictions on the s m o o t h e r matrices Recent work by Opsomer and R u p p e r t (1997) discuss some im- provements to this algorithm which are guaranteed to provide a unique solution T h e y also derive the conditional mean squared error of the re- sulting estimator nnder strong conditions: this has a similar expression to (2.5) in large samples
3 S i m u l a t i o n R e s u l t s
A.1 I n t r o d u c t i o n
In a number of different additive models, we determined the bias, variance and mean squared error for b o t h estimation procedures We considered designs with distributions: the uniform U [ - 3 , 3] d, the normal with mean
0, variance 1 and varying covariance p 0, 0.4, 0.8, d e n o t e d as N(p), for different numbers of observations and several dimensions We drew all these designs once and kept t h e m fixed for the investigation described in the following The error t e r m c was always chosen as normal d i s t r i b u t e d
2 0.5 Since b o t h estimators are linear, with zero mean and variance cre
i.e.,
i 1 for some weights {w~,i(x)} we d e t e r m i n e d the conditional bias and variance
as follows
i 1 bias { r ~ , ( x ~ ) l X } E Wcd(X)r~z(Xi) - r~zcr(x~
i = 1 for the additive function estimators and b y analogy for the regression esti- mator In the following notation the MSE denotes the mean squared error and the M A S E the averaged MSE We focused on the following questions: a) W h a t is a reasonable b a n d w i d t h choice for an optimal fit'?
b) How sensitive are the estimators to the b a n d w i d t h ?
Trang 8426 S Sperlich, O.B Linton and W Hiirdle
c) What are tile MASE, MSE, bias and variance, boundary effects? d) We considered degrees of freedom, eigen analysis, singular values and eigen vectors
e) We plotted the equivalent kernel weights of the estimates and ]) we investigated whether and when the asymptotics kick in
We examined how well the estimation procedures performed in estimating one additive function The parameters are d = 2 dimensions and n =
100 observations We considered all combinations of the following additive functions for a two dimensional additive model:
The advantages of using local polynomials are well known, especially with regard to tile robustness against choice of bandwidth and the im- provement in bias and consequently mean squared error if the requisite smoothness is presem In Severance-Lossin and Sperlich (1997) the con- sistency and asymptotic behavior of the integration estimator using local polynomial is shown For these reasons we did the investigation for both, the Nadaraya Watson and the local linear estimator
A 2 B a n d w i d t h C h o i c e
Tile choice of an appropriate smoothing parameter is always a critical point
in nonparametric and semiparametric estimation For tile integration es- timator we need even two bandwidths, hi and h2, see Section 2 There exist at least two rules for choosing them: tile rule of thumb of Linton and Nielsen (1995) and the plug-in method suggested in Severance-Lossin and Sperlich (1997) Both methods give the MASE minimizing bandwidth, the
Trang 9h~tegration and backfitting methods in additive models 427
first one approximately with tile aid of parametric pre-estimators, tile sec- ond one by using nonparametric pre-estimators We give here the formulas for the case of local linear smoothers The rule of thumb is
For a fair comparison of the optimal bandwidth and the corresponding MASE of both estimators we applied several procedures We started with considering the minimal MASE of the overall regression function and the minimizing bandwidths Then we looked for the bandwidths minimizing the MASE in each direction separately For taking into account the influence
of boundary effects we looked also for the optimal bandwidths on trimmed data
For small samples of 100 observations we could not discover any in- formation by comparing the numerically MASE-minimizing bandwidths They differed a lot depending on the particularly drawn design There- fore we focused on once drawn, in that sense fixed, designs for the whole paper and considered only analytically deternfined bandwidths hi Thus
we compared the results for bandwidths calculated with the rule of thumb proposed by Linton and Nielsen and the analytically optimal one
Trang 104 2 8 S b~erlich, O.B Linton and W H{irdle
b a n d w i d t h s t h a t we found numerically for t h e particular designs in finite samples, were not particularly illuminating, we do not report t h e m in tlle tables In Table 1 tile b a n d w i d t h s of tlle rule of t h u m b by Linton a n d Nielsen a n d the a s y m p t o t i c a l l y optimal b a n d w i d t h s for each e s t i m a t i o n procedure are shown Here we concentrated on b a n d w i d t h s t h a t minimize the MASE in each direction separately T h e y are displayed for the additive components m3, m4 versus the particular model a n d design The behavior for rni, m2 is the same, the results can be requested Dora tile authors One can see very well tile strong influence of tile distribution and tile dependence
of the additive fimction t h a t has to be estimated Furthermore, not only
do the b a n d w i d t h s d e t e r m i n e d by theory based rules differ a lot, we found then1 quite often far away fi'onl the MASE mininlizing b a n d w i d t h value This is also tlle case for tile local linear smoothers Mostly, tlle analytically chosen b a n d w i d t h was closer to tile MASE minimizing one t h a n the rule of
t h m n b b a n d w i d t h , which, however, is nmch easier to calculate
If the optimal value was infinity, we set it to 1 or in tile case of a N(0.8) d i s t r i b u t e d design to 2 In formulas where we had to integrate over
a density fi'om - o o to + o o we did this [for numerical reasons] over tlle
interval [-1.5, 1.5] for N(0.8) and over [-3,3] else
Trang 11h~tegration and back~tting methods in additive models 4 2 9
Table 2 gives the optimal b a n d w i d t h s for different distributions, models, estimation routines and criteria when using local linear smoothing
Table 2: Asymptotically optimai bandwidths when using toea.l linear smoother
All findings from Table 1 are replicated here Furthermore, note t h a t for uncorrelated regressors tile b a n d w i d t h s are ahnost tile same for back- fitting and integration m e t h o d , which is in accordance with the theoreti- cally similar MASE As mentioned above we will now consider tile choice
of b a n d w i d t h for tile local linear estimation procedure in a more detailed
way
A 3 R o b u s t n e s s w i t h r e s p e c t t o t h e C h o i c e o f B a n d w i d t h
To find out how sensitive the estimators are with respect to the choice of
b a n d w i d t h for the direction of interest hi we p l o t t e d M A S E and MSE~:-0 against b a n d w i d t h for tile two models m - m2 + rna and m - ?n2 q- ~n4 Tim parameters were kept unchanged or were mentioned in the caption of tile respective figures We present our results first for tile uniform design on [ - 3 , 3] 2, then for designs with distribution N(0.0) and N(0.4), see Figures
l t o 6
The results for M A S E have been t r i m m e d in the pictures~ since other- wise they would have been d o m i n a t e d by b o u n d a r y effects (compare with tables in the next section) Tile results for the integration estimator are
Trang 124 3 0 S Sperlich~ O.B Linton and W Hiirdle
Figure 1: Performance by bandwidth h z of -~IASE (top) and MSE~=o ( b o t t o m ) in
n, odet m m-)+ma, s e p a r a t e l y f o r m2 (left), ma O'ight) D e s i g n is X ~ U[-3,3] 2
drawn throughout tile paper as solid lines, those for tile backfittiug algo- rithm as dashed lines
Obviously, tlle backfitting estimator is very sensitive to the choice of bandwidth To get a small M A S E it is crucially important for tile backfit- ting m e t h o d to choose a good s m o o t h i n g parameter For correlated designs oversmoothing seems slightly preferable, otherwise there is no particular advantage to either oversmoothing or undersmoothing T h e behavior of tlle estimates for tlle highly correlated design is slightly strange and hard
to interpret This is true for b o t h estimation procedures Therefore we skipped the figures for tile N ( 0 8 ) distributed design
For tile integration estimator the results differ depending on the model
In general this m e t h o d is by far not as sensitive to the choice of b a n d w i d t h
as the backfitting procedure is If we focus on the M S E x - 0 we have similar results as for tile M A S E but weakened concerning the sensitivity Here the results differ more depending on the data generating model
Since in a [ - 3 , 3] 2 rectangle n 100 observations are fairly sparse and thus the behavior of the M A S E or MSE~:-0 perhaps is not typical, we did
Trang 13Integration and backfitt.i~w methods in additive models 4 3 1
J
/' /
- - - -
o~ 0.4 o~ o.~ o.~ o'~ o~ 05
Figure 2: Performa.nce by ba.ndmldth th of M A S E (top) and MSE~ o (bottom) in mode! = ~7~2 + ~7~4, separa.tety for ~7~2 (left), m4 (right) Design is X ~ U[ 3, 3] 2
/x
02 03 04 05 O~ 07
m 3
/ / /
, ~ - ,.'
/ /
/
0 2 0.3 0 4 0 3 06 07
Figure 3: Performa.nce by ba.ndwidth hi of M A S E (top) and "~lSEx=o (bottom) in
Trang 144 3 2 S 5~erlich, O.B Linton and W HSrdle
m E
2
///
x / /
/ / ' / + ,.,'"
m E .,, ""
/ ++
02 03 04 05 o( o?
b~lwxlth
m 4 / "
~ / ""
- , , "
02 03 04 05 O~ 07
Figure 4: Performa.ncc by ba.ndwidth hi of M A S E (top) and MSEx o (bottom) in
m o d c l = ~7~2 + ~7~4, scparatcly for ~7~2 (icft), m40"ight) Dcsign is X ~ N(O.O)
/
m S
/ / +,
, /
Ffgure 5: Perfornlallcc by ba.ndwfdth th of M A S E (top) and "~lSEx=o (bottom) in
model r + en;~, separa.tety for r (left), m3 (right) Design is' X ~ N(0.4)
Trang 15h~tegration and back~tting methods in additive models 4 3 3
the same investigation with n - 100 observations for the uniform design
on [ 1.5, 1.5] 2 But, plotting the MASE and MSE~, 0 functions on the same scale as we did for the U[ 3, 3] 2 design, we detected that the general sensitivity is similar but certainly on a different range Furthermore, the integration estimate improved a lot since it has been suffering more when data were sparse as e.g., in [ - 3 , 3] 2 All in all, our observations above are confirmed when data were not too sparse
Trang 164 3 4 S Sperlich, O.B Linton and W Hiirdle
0.205 0.049 0.028 0.619 0.252
U "~ ,'V(.O) N ( 4 ) N ( 8 )
0.402 0.137 0.234 0.411 0.027 0.031 0.066 0.051 0.054 0.057 0.030 0.029 0.206 0.175 0.285 0.159 0.043 0.053
0.053 0.033 0.081 0.071 0.058 0.028 0.528 0.149 0.057 0.035 0.608 0.194
0.022 0.027 0.026 0.028 0.018 0.018 0.016 0.019 0.018 0.023 0.023 0.024 0.014 0.016 0.015 0.017 0.033 0.033 0.036 0.048 0.027 0.022 0.020 0.026 0.191 0.043 0.074 0.137 0.180 0.020 0.023 0.093 0.040 0.040 0.043 0.042 0.333 0.024 0.024 0.024 0.066 0.063 0.104 0.132 0.051 0.031 0.041 0.089
0.101 0.046 0.055 0.081 0.078 0.053 0.049 0.056 0.070 0.026 0.022 0.041 0.077 0.051 0.062 0.096 0.071 0.039 0.048 0.075 0.177 0.057 0.603 2.413 0.166 0.040 0.265 1.020 0.079 0.065 0.066 0.064 0.069 0.038 0.037 0.038 0.145 0.085 0.670 2.238 0.123 0.044 0.257 0.681
Table 3: M A S E , using the local linear smoother over all (upper) and over trimmed (iower) data Here; ba stays for ba.ckfitting and in for ma~'ginat integration
Trang 17h~tegration and backfitting methods in additive models 4 3 5
0.000 0.002 0.001 0.004 0.001 0.006 0.004 0.004 0.000 0.005 0.003 0.002 0.004 0.004 0.003 0.010 0.002 0.003 0.002 0.007 0.142 0.013 0.034 0.055 0.119 0.003 0.003 0.031 0.003 0.002 0.002 0.004
0.002 0.001 0.001 0.002
0.005 0.022 0.050 0.040 0.003 0.006 0.012 0.024
0.050 0.024 0.034 0.059 0.039 0.017 0.026 0.047 0.017 0.005 0.011 0.035 0.010 0.002 0.002 0.019 0.050 0.025 0.036 0.060 0.040 0.019 0.029 0.050 0.132 0.027 0.577 2.391 0.097 0.019 0.211 0.724 0.007 0.005 0.003 0.003 0.004 0.002 0.002 0.001 0.040 0.012 0.600 2.182 0.023 0.005 0.170 0.539
Trang 184 3 6 S Sperlich, O.B Linton and W Hiirdle
0.025 0.020 0.022 0.024
Distr
0.018 0.017 0.023 0.039 0.013 0.010 0.011 0.026
0.054 0.040 0.045 0.043 0.042 0.021 0.020 0.018 0.073 0.040 0.059 0.097 0.055 0.021 0.021 0.040 0.061 0.050 0.052 0.043 0.049 0.027 0.027 0.022 0.088 0.052 0.077 0.126 0.067 0.029 0.033 0.064
0.022 0.025 0.025 0.020 0.015 0.014 0.014 0.013 0.017 0.017 0.019 0.020 0.012 0.009 0.010 0.013 0.029 0.029 0.034 0.038 0.021 0.016 0.015 0.015 0.049 0.029 0.040 0.083 0.033 0.015 0.016 0.046 0.037 0.038 0.041 0.038 0.028 0.021 0.021 0.019 0.061 0.041 0.054 0.092 0.044 0.023 0.025 0.053
0.057 0.044 0.047 0.052 0.043 0.022 0.021 0.023 0.061 0.047 0.038 0.024 0.046 0.021 0.017 0.012 0.027 0.026 0.026 0.036 0.018 0.012 0.011 0.014 0.045 0.030 0.026 0.022 0.031 0.015 0.012 0.010 0.073 0.061 0.063 0.061 0.058 0.034 0.033 0.031 0.105 0.073 0.070 0.056 0.082 0.037 0.030 0.023
Trang 19h~tegration and backfitting m e t h o & in additive models 4 3 7
for the complete d a t a set in the u p p e r line, for tile trinnned d a t a in the lower line
We found three main points:
T h e integration estimator is suffering more from b o u n d a r y effects For increasing correlation b o t h estimators get problems b u t much more the integration estimator This is in line with tile t h e o r y say- ing that the integration estimator is inefficient for correlated designs, see Linton (1997) He suggested an estimator for additive models
c o n s t r u c t e d as a mixture of marginal integration and one-iteration- backfit and proved that for correlated designs this procedure donfi- hates a s y m p t o t i c a l l y the integration m e t h o d in its variance part
We want to emphasize our s t a t e m e n t s by looking closer to the behavior of squared bias, variance and M S E over the range
The main difference from the results for using the N a d a r a y a W a t s o n
s m o o t h e r is t h a t tile local linear smoother improves the integration esti-
m a t o r more t h a n the backfitting estimator The effects concerning the distribution of X and model structure are, as we expected, quite similar The following figures illustrate the behavior and the trade-off of variance and bias for b o t h estimators in each additive direction T h e y are p l o t t e d on the range of the support The b o u n d a r i e s of the d a t a are cut off at a level
of 5% each side, since otherwise their effects would d o m i n a t e the pictures
T h e figures 7 11 reinforce clearly our observations and remarks concerning the Tables 3 5 T h e y show the variance, squared bias and M A S E over the whole range of X1, respectively X,~, i.e a p p r o x i m a t e l y over ( - 3 , 3) for
U 2 [ - 3 , 3] and ( - 1 7 , 1.7) for normal design with coy 0.8 Tile absolute values in the vertical direction are not of interest here, otherwise, see Tables
3 5
Trang 204 3 8 S Sperlich, O.B Linton and W Hiirdle