36 3 The Distribution of a Sample Average 44 3.1 Variance of a Sample Average... Distribution of √T times sample average √T times sample average T=5 T=25 T=50 T=100 Figure 1.1: Sampling
Trang 1Lecture Notes for Econometrics 2002 (first year
PhD course in Stockholm)
Paul Söderlind1
June 2002 (some typos corrected and some material added later)
1University of St Gallen Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St Gallen,Switzerland E-mail: Paul.Soderlind@unisg.ch Document name: EcmAll.TeX
Trang 21.1 Means and Standard Deviation 5
1.2 Testing Sample Means 6
1.3 Covariance and Correlation 8
1.4 Least Squares 10
1.5 Maximum Likelihood 11
1.6 The Distribution of Oˇ 12
1.7 Diagnostic Tests 14
1.8 Testing Hypotheses about Oˇ 14
A Practical Matters 16 B A CLT in Action 17 2 Univariate Time Series Analysis 21 2.1 Theoretical Background to Time Series Processes 21
2.2 Estimation of Autocovariances 22
2.3 White Noise 25
2.4 Moving Average 25
2.5 Autoregression 28
2.6 ARMA Models 35
2.7 Non-stationary Processes 36
3 The Distribution of a Sample Average 44 3.1 Variance of a Sample Average 44
3.2 The Newey-West Estimator 48
Trang 33.3 Summary 50
4 Least Squares 53 4.1 Definition of the LS Estimator 53
4.2 LS and R2 55
4.3 Finite Sample Properties of LS 57
4.4 Consistency of LS 58
4.5 Asymptotic Normality of LS 60
4.6 Inference 63
4.7 Diagnostic Tests of Autocorrelation, Heteroskedasticity, and Normality 66 5 Instrumental Variable Method 74 5.1 Consistency of Least Squares or Not? 74
5.2 Reason 1 for IV: Measurement Errors 74
5.3 Reason 2 for IV: Simultaneous Equations Bias (and Inconsistency) 76
5.4 Definition of the IV Estimator—Consistency of IV 80
5.5 Hausman’s Specification Test 86
5.6 Tests of Overidentifying Restrictions in 2SLS 87
6 Simulating the Finite Sample Properties 89 6.1 Monte Carlo Simulations in the Simplest Case 89
6.2 Monte Carlo Simulations in More Complicated Cases 91
6.3 Bootstrapping in the Simplest Case 93
6.4 Bootstrapping in More Complicated Cases 93
7 GMM 97 7.1 Method of Moments 97
7.2 Generalized Method of Moments 98
7.3 Moment Conditions in GMM 98
7.4 The Optimization Problem in GMM 101
7.5 Asymptotic Properties of GMM 105
7.6 Summary of GMM 110
7.7 Efficient GMM and Its Feasible Implementation 111
7.8 Testing in GMM 112
Trang 47.9 GMM with Sub-Optimal Weighting Matrix 114
7.10 GMM without a Loss Function 115
7.11 Simulated Moments Estimator 116
8 Examples and Applications of GMM 119 8.1 GMM and Classical Econometrics: Examples 119
8.2 Identification of Systems of Simultaneous Equations 129
8.3 Testing for Autocorrelation 131
8.4 Estimating and Testing a Normal Distribution 135
8.5 Testing the Implications of an RBC Model 139
8.6 IV on a System of Equations 140
12 Vector Autoregression (VAR) 142 12.1 Canonical Form 142
12.2 Moving Average Form and Stability 143
12.3 Estimation 146
12.4 Granger Causality 146
12.5 Forecasts Forecast Error Variance 147
12.6 Forecast Error Variance Decompositions 148
12.7 Structural VARs 149
12.8 Cointegration and Identification via Long-Run Restrictions 159
12 Kalman filter 166 12.1 Conditional Expectations in a Multivariate Normal Distribution 166
12.2 Kalman Recursions 167
13 Outliers and Robust Estimators 173 13.1 Influential Observations and Standardized Residuals 173
13.2 Recursive Residuals 174
13.3 Robust Estimation 176
13.4 Multicollinearity 178
14 Generalized Least Squares 181 14.1 Introduction 181
14.2 GLS as Maximum Likelihood 182
Trang 514.3 GLS as a Transformed LS 185
14.4 Feasible GLS 185
15 Nonparametric Regressions and Tests 187 15.1 Nonparametric Regressions 187
15.2 Estimating and Testing Distributions 195
21 Some Statistics 202 21.1 Distributions and Moment Generating Functions 202
21.2 Joint and Conditional Distributions and Moments 204
21.3 Convergence in Probability, Mean Square, and Distribution 207
21.4 Laws of Large Numbers and Central Limit Theorems 209
21.5 Stationarity 210
21.6 Martingales 210
21.7 Special Distributions 211
21.8 Inference 221
22 Some Facts about Matrices 223 22.1 Rank 223
22.2 Vector Norms 223
22.3 Systems of Linear Equations and Matrix Inverses 223
22.4 Complex matrices 226
22.5 Eigenvalues and Eigenvectors 227
22.6 Special Forms of Matrices 227
22.7 Matrix Decompositions 229
22.8 Matrix Calculus 235
22.9 Miscellaneous 238
Trang 61 Introduction
The mean and variance of a series are estimated as
in-a moving in-averin-age—sometimes used in in-anin-alysis of finin-anciin-al prices.)
If xt is iid (independently and identically distributed), then it is straightforward to findthe variance of the sample average Then, note that
Trang 70 0.2
0.4
b Distribution of √T times sample average
√T times sample average
T=5 T=25 T=50 T=100
Figure 1.1: Sampling distributions This figure shows the distribution of the sample meanand ofp
T times the sample mean of the random variable zt 1 where zt 2.1/
the second equality follows from the assumption of identical distributions which impliesidentical expectations
The law of large numbers (LLN) says that the sample mean converges to the true tion mean as the sample size goes to infinity This holds for a very large class of randomvariables, but there are exceptions A sufficient (but not necessary) condition for this con-vergence is that the sample average is unbiased (as in (1.3)) and that the variance goes tozero as the sample size goes to infinity (as in (1.2)) (This is also called convergence inmean square.) To see the LLN in action, see Figure 1.1
popula-The central limit theorem (CLT) says thatp
T Nx converges in distribution to a normaldistribution as the sample size increases See Figure 1.1 for an illustration This alsoholds for a large class of random variables—and it is a very useful result since it allows
us to test hypothesis Most estimators (including LS and other methods) are effectivelysome kind of sample average, so the CLT can be applied
The basic approach in testing a hypothesis (the “null hypothesis”), is to compare thetest statistics (the sample average, say) with how the distribution of that statistics (which
is a random number since the sample is finite) would look like if the null hypothesis istrue For instance, suppose the null hypothesis is that the population mean is Supposealso that we know that distribution of the sample mean is normal with a known variance
h2(which will typically be estimated and then treated as if it was known) Under the nullhypothesis, the sample average should then be N.; h2/ We would then reject the null
Trang 8hypothesis if the sample average is far out in one the tails of the distribution A traditionaltwo-tailed test amounts to rejecting the null hypothesis at the 10% significance level ifthe test statistics is so far out that there is only 5% probability mass further out in thattail (and another 5% in the other tail) The interpretation is that if the null hypothesis isactually true, then there would only be a 10% chance of getting such an extreme (positive
or negative) sample average—and these 10% are considered so low that we say that thenull is probably wrong
Density function of N(0,2)
y = x−0.5 Pr(y ≤ −2.33) = 0.05
Figure 1.2: Density function of normal distribution with shaded 5% tails
See Figure 1.2 for some examples or normal distributions recall that in a normaldistribution, the interval ˙1 standard deviation around the mean contains 68% of theprobability mass; ˙1:65 standard deviations contains 90%; and ˙2 standard deviationscontains 95%
In practice, the test of a sample mean is done by “standardizing” the sampe mean so
Trang 9that it can be compared with a standard N.0; 1/ distribution The logic of this is as follows
calcu-To construct a two-tailed test, we also need.the probability that Nx is above some ber This number is chosen to make the two-tailed tst symmetric, that is, so that there
num-is as much probability mass below lower number (lower tail) as above the upper number(upper tail) With a normal distribution (or, for that matter, any symmetric distribution)this is done as follows Note that Nx /= h N.0; 1/ is symmetric around 0 This meansthat the probability of being above some number, C /= h, must equal the probability
of being below 1 times the same number, or
is by looking at the normal cumulative distribution function—see Figure 1.2
The covariance of two variables (here x and y) is typically estimated as
bCov xt; zt/DPT
Note that this is a kind of sample average, so a CLT can be used
The correlation of two variables is then estimated as
bCorr xt; zt/D Cov xb t; zt/
c
where cStd.xt/ is an estimated standard deviation A correlation must be between 1 and 1
Trang 10Pdf of t when true β =0.51 Probs: 0.05 0.05
t ≤ −1.65 and t > 1.65 (10% critical values)
Figure 1.3: Power of two-sided test
(try to show it) Note that covariance and correlation measure the degree of linear relationonly This is illustrated in Figure 1.4
The pth autocovariance of x is estimated by
bCov xt; xt p D PTt D1.xt Nx/ xt p Nx =T; (1.9)where we use the same estimated (using all data) mean in both places Similarly, the pthautocorrelationis estimated as
bCorr xt; xt p D Cov xb t; xt p
Trang 11where all variables are zero mean scalars and where ˇ0 is the true value of the parameter
we want to estimate The task is to use a samplefyt; xtgTt D1 to estimate ˇ and to testhypotheses about its value, for instance that ˇD 0
If there were no movements in the unobserved errors, ut, in (1.11), then any samplewould provide us with a perfect estimate of ˇ With errors, any estimate of ˇ will stillleave us with some uncertainty about what the true value is The two perhaps most impor-tant issues in econometrics are how to construct a good estimator of ˇ and how to assess
Trang 12the uncertainty about the true value.
For any possible estimate, Oˇ, we get a fitted residual
One appealing method of choosing Oˇ is to minimize the part of the movements in yt that
we cannot explain by xtˇ, that is, to minimize the movements inO Out There are severalcandidates for how to measure the “movements,” but the most common is by the mean ofsquared errors, that is, ˙t D1T Ou2t=T We will later look at estimators where we instead use
˙t D1T j Outj =T
With the sum or mean of squared errors as the loss function, the optimization problem
minˇ
1T
TX
t D1
has the first order condition that the derivative should be zero as the optimal estimate Oˇ
1T
TX
t D1
xt2
1T
TX
Trang 13Since the errors are independent, we get the joint pdf of the u1; u2; : : : ; uT by multiplyingthe marginal pdfs of each of the errors Then substitute yt xtˇ for ut (the derivative ofthe transformation is unity) and take logs to get the log likelihood function of the sample
TX
t D1
This likelihood function is maximized by minimizing the last term, which is tional to the sum of squared errors - just like in (1.13): LS is ML when the errors are iidnormally distributed
propor-Maximum likelihood estimators have very nice properties, provided the basic tributional assumptions are correct If they are, then MLE are typically the most effi-cient/precise estimators, at least asymptotically ML also provides a coherent frameworkfor testing hypotheses (including the Wald, LM, and LR tests)
dis-1.6 The Distribution of O ˇ
Equation (1.15) will give different values of Oˇ when we use different samples, that isdifferent draws of the random variables ut, xt, and yt Since the true value, ˇ0, is a fixedconstant, this distribution describes the uncertainty we should have about the true valueafter having obtained a specific estimated value
To understand the distribution of Oˇ, use (1.11) in (1.15) to substitute for yt
O
T
TX
t D1
x2t
1T
TX
t D1
xt2
1T
TX
t D1
where ˇ0is the true value
The first conclusion from (1.19) is that, with ut D 0 the estimate would always beperfect — and with large movements in ut we will see large movements in Oˇ The secondconclusion is that not even a strong opinion about the distribution of ut, for instance that
ut is iid N 0; 2, is enough to tell us the whole story about the distribution of Oˇ Thereason is that deviations of Oˇ from ˇ0 are a function of xtut, not just of ut Of course,
Trang 14when xt are a set of deterministic variables which will always be the same irrespective
of which sample we use, then Oˇ ˇ0 is a time invariant linear function of ut, so thedistribution of ut carries over to the distribution of Oˇ This is probably an unrealisticcase, which forces us to look elsewhere to understand the properties of Oˇ
There are two main routes to learn more about the distribution of Oˇ: (i) set up a small
“experiment” in the computer and simulate the distribution or (ii) use the asymptoticdistribution as an approximation The asymptotic distribution can often be derived, incontrast to the exact distribution in a sample of a given size If the actual sample is large,then the asymptotic distribution may be a good approximation
A law of large numbers would (in most cases) say that bothPT
t D1x2t=T andPT
t D1xtut=T
in (1.19) converge to their expected values as T ! 1 The reason is that both are sampleaverages of random variables (clearly, both x2t and xtut are random variables) These ex-pected values are Var.xt/ and Cov.xt; ut/, respectively (recall both xt and ut have zeromeans) The key to show that Oˇ is consistent, that is, has a probability limit equal to ˇ0, isthat Cov.xt; ut/ D 0 This highlights the importance of using good theory to derive notonly the systematic part of (1.11), but also in understanding the properties of the errors.For instance, when theory tells us that yt and xt affect each other (as prices and quanti-ties typically do), then the errors are likely to be correlated with the regressors - and LS
is inconsistent One common way to get around that is to use an instrumental variablestechnique More about that later Consistency is a feature we want from most estimators,since it says that we would at least get it right if we had enough data
Suppose that Oˇ is consistent Can we say anything more about the asymptotic bution? Well, the distribution of Oˇ converges to a spike with all the mass at ˇ0, but thedistribution of p
distri-T Oˇ, or p
T Oˇ ˇ0
, will typically converge to a non-trivial normaldistribution To see why, note from (1.19) that we can write
t D1
xt2
TT
TX
t D1
The first term on the right hand side will typically converge to the inverse of Var.xt/, asdiscussed earlier The second term isp
T times a sample average (of the random variable
xtut) with a zero expected value, since we assumed that Oˇ is consistent Under weakconditions, a central limit theorem applies sop
T times a sample average converges to
a normal distribution This shows that p
T Oˇ has an asymptotic normal distribution It
Trang 15turns out that this is a property of many estimators, basically because most estimators aresome kind of sample average For an example of a central limit theorem in action, seeAppendix B
Exactly what the variance ofp
T Oˇ ˇ0/ is, and how it should be estimated, dependsmostly on the properties of the errors This is one of the main reasons for diagnostic tests.The most common tests are for homoskedastic errors (equal variances of ut and ut s) and
no autocorrelation (no correlation of ut and ut s)
When ML is used, it is common to investigate if the fitted errors satisfy the basicassumptions, for instance, of normality
1.8 Testing Hypotheses about O ˇ
Suppose we now assume that the asymptotic distribution of Oˇ is such that
Trang 16Figure 1.5: Probability density functions
The natural interpretation of a really large test statistics, jpT Oˇ=vj D 3 say, is that
it is very unlikely that this sample could have been drawn from a distribution where thehypothesis ˇ0D 0 is true We therefore choose to reject the hypothesis We also hope thatthe decision rule we use will indeed make us reject false hypothesis more often than wereject true hypothesis For instance, we want the decision rule discussed above to reject
ˇ0 D 0 more often when ˇ0 D 1 than when ˇ0 D 0
There is clearly nothing sacred about the 5% significance level It is just a matter ofconvention that the 5% and 10% are the most widely used However, it is not uncommon
to use the 1% or the 20% Clearly, the lower the significance level, the harder it is to reject
a null hypothesis At the 1% level it often turns out that almost no reasonable hypothesiscan be rejected
The t-test described above works only if the null hypothesis contains a single tion We have to use another approach whenever we want to test several restrictionsjointly The perhaps most common approach is a Wald test To illustrate the idea, suppose
restric-ˇ is an m 1 vector and thatpT Oˇ ! N 0; V / under the null hypothesis , where V is adcovariance matrix We then know that
Trang 17A Practical Matters
Gauss, MatLab, RATS, Eviews, Stata, PC-Give, Micro-Fit, TSP, SAS
Software reviews in The Economic Journal and Journal of Applied EconometricsA.0.2 Useful Econometrics Literature
1 Greene (2000), Econometric Analysis (general)
2 Hayashi (2000), Econometrics (general)
3 Johnston and DiNardo (1997), Econometric Methods (general, fairly easy)
4 Pindyck and Rubinfeld (1998), Econometric Models and Economic Forecasts eral, easy)
(gen-5 Verbeek (2004), A Guide to Modern Econometrics (general, easy, good tions)
applica-6 Davidson and MacKinnon (1993), Estimation and Inference in Econometrics eral, a bit advanced)
(gen-7 Ruud (2000), Introduction to Classical Econometric Theory (general, consistentprojection approach, careful)
8 Davidson (2000), Econometric Theory (econometrics/time series, LSE approach)
9 Mittelhammer, Judge, and Miller (2000), Econometric Foundations (general, vanced)
ad-10 Patterson (2000), An Introduction to Applied Econometrics (econometrics/time ries, LSE approach with applications)
se-11 Judge et al (1985), Theory and Practice of Econometrics (general, a bit old)
12 Hamilton (1994), Time Series Analysis
Trang 1813 Spanos (1986), Statistical Foundations of Econometric Modelling, Cambridge versity Press (general econometrics, LSE approach)
Uni-14 Harvey (1981), Time Series Models, Philip Allan
15 Harvey (1989), Forecasting, Structural Time Series (structural time series, Kalmanfilter)
16 Lütkepohl (1993), Introduction to Multiple Time Series Analysis (time series, VARmodels)
17 Priestley (1981), Spectral Analysis and Time Series (advanced time series)
18 Amemiya (1985), Advanced Econometrics, (asymptotic theory, non-linear metrics)
econo-19 Silverman (1986), Density Estimation for Statistics and Data Analysis (density timation)
es-20 Härdle (1990), Applied Nonparametric Regression
2.1/:) When zt is iid 2.1/, then ˙T
t D1zt is distributed as a 2.T / variable with pdf
fT./ We now construct a new variable by transforming ˙t D1T zt as to a sample meanaround one (the mean ofzt)
Nz1D ˙t D1T zt=T 1D ˙t D1T zt 1/ =T:
Clearly, the inverse function is˙t D1T zt D T Nz1C T , so by the “change of variable” rule
we get the pdf of Nz1as
g.Nz1/D fT TNz1C T / T:
Trang 19Example B.3 Continuing the previous example, we now consider the random variable
Nz2 DpTNz1;with inverse function Nz1 D Nz2=p
T By applying the “change of variable” rule again, weget the pdf of Nz2 as
Example B.4 Whenzt is iid2.1/, then ˙t D1T zt is2.T /, which we denote f ˙t D1T zt/
We now construct two new variables by transforming˙t D1T zt
Nz1 D ˙t D1T zt=T 1D ˙t D1T zt 1/ =T , and
Nz2 DpT Nz1:Example B.5 We transform this distribution by first subtracting one fromzt (to removethe mean) and then by dividing byT orp
T This gives the distributions of the samplemean and scaled sample mean,Nz2 DpTNz1as
2T =2 T =2/y
T =2 1exp y=2/ with y D T Nz1C T , and
2T =2 T =2/y
T =2 1exp y=2/ with y DpTNz1C T
These distributions are shown in Figure 1.1 It is clear thatf Nz1/ converges to a spike
at zero as the sample size increases, while f Nz2/ converges to a (non-trivial) normaldistribution
Example B.6 (Distribution of ˙t D1T zt 1/ =T and p
We transform this distribution by first subtracting one fromzt (to remove the mean) andthen by dividing by T or p
T This gives the distributions of the sample mean, Nz1 D
Trang 20˙t D1T zt 1/ =T , and scaled sample mean, Nz2DpT Nz1as
2T =2 T =2/y
T =2 1exp y=2/ with y D T Nz1C T , and
2T =2 T =2/y
T =2 1exp y=2/ with y DpTNz1C T
These distributions are shown in Figure 1.1 It is clear thatf Nz1/ converges to a spike
at zero as the sample size increases, while f Nz2/ converges to a (non-trivial) normaldistribution
Bibliography
Amemiya, T., 1985, Advanced econometrics, Harvard University Press, Cambridge, sachusetts
Mas-Davidson, J., 2000, Econometric theory, Blackwell Publishers, Oxford
Davidson, R., and J G MacKinnon, 1993, Estimation and inference in econometrics,Oxford University Press, Oxford
Greene, W H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, NewJersey, 4th edn
Hamilton, J D., 1994, Time series analysis, Princeton University Press, Princeton
Härdle, W., 1990, Applied nonparametric regression, Cambridge University Press, bridge
Cam-Harvey, A C., 1989, Forecasting, structural time series models and the Kalman filter,Cambridge University Press
Hayashi, F., 2000, Econometrics, Princeton University Press
Johnston, J., and J DiNardo, 1997, Econometric methods, McGraw-Hill, New York, 4thedn
Lütkepohl, H., 1993, Introduction to multiple time series, Springer-Verlag, 2nd edn
Trang 21Mittelhammer, R C., G J Judge, and D J Miller, 2000, Econometric foundations, bridge University Press, Cambridge.
Cam-Patterson, K., 2000, An introduction to applied econometrics: a time series approach,MacMillan Press, London
Pindyck, R S., and D L Rubinfeld, 1998, Econometric models and economic forecasts,Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn
Priestley, M B., 1981, Spectral analysis and time series, Academic Press
Ruud, P A., 2000, An introduction to classical econometric theory, Oxford UniversityPress
Silverman, B W., 1986, Density estimation for statistics and data analysis, Chapman andHall, London
Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn
Trang 222 Univariate Time Series Analysis
Reference: Greene (2000) 13.1-3 and 18.1-3
Additional references: Hayashi (2000) 6.2-4; Verbeek (2004) 8-9; Hamilton (1994); ston and DiNardo (1997) 7; and Pindyck and Rubinfeld (1998) 16-18
Suppose we have a sample of T observations of a random variable
˚yi t T
t D1 D˚y1i; y2i; :::; yTi ;where subscripts indicate time periods The superscripts indicate that this sample is fromplanet (realization) i We could imagine a continuum of parallel planets where the sametime series process has generated different samples with T different numbers (differentrealizations)
Consider period t The distribution of yt across the (infinite number of) planets hassome density function, ft.yt/ The mean of this distribution
Now consider periods t and t s jointly On planet i we have the pair˚yi
t s; yti The bivariate distribution of these pairs, across the planets, has some density function
gt s;t.yt s; yt/.1 Calculate the covariance between yt s and yt as usual
1 The relation between ft.y t / and g t s;t y t s ; y t / is, as usual, f t y t / = R 1
1 g t s;t y t s ; y t / dy t s
Trang 23This is the st hautocovariance of yt (Of course, s D 0 or s < 0 are allowed.)
A stochastic process is covariance stationary if
TX
This means that the link between the values in t and t s goes to zero sufficiently fast
as s increases (you may think of this as getting independent observations before we reachthe limit) If yt is normally distributed, then (2.8) is also sufficient for the process to beergodic for all moments, not just the mean Figure 2.1 illustrates how a longer and longersample (of one realization of the same time series process) gets closer and closer to theunconditional distribution as the sample gets longer
Let yt be a vector of a covariance stationary and ergodic The sth covariance matrix is
Trang 24sample length
Mean Std
Figure 2.1: Sample of one realization of yt D 0:85yt 1C"t with y0 D 4 and Std."t/D 1
Note that R s/ does not have to be symmetric unless sD 0 However, note that R s/ D
R s/0 This follows from noting that
Trang 25Example 2.1 (Bivariate case.) Letyt D Œxt; zt0with Ext DEzt D 0 Then
#:Note thatR s/ is
"
Cov.xt; xt Cs/ Cov xt; zt Cs/Cov.zt; xt Cs/ Cov zt; xt Cs/
#
D
"
Cov.xt s; xt/ Cov xt s; zt/Cov.zt s; xt/ Cov zt s; xt/
#
;which is indeed the transpose ofR s/
The autocovariances of the (vector) yt process can be estimated as
O
R s/D 1
T
TX
Trang 262.3 White Noise
A white noise time process has
E"t D 0Var "t/D 2, and
If, in addition, "t is normally distributed, then it is said to be Gaussian white noise Theconditions in (2.4)-(2.6) are satisfied so this process is covariance stationary Moreover,(2.8) is also satisfied, so the process is ergodic for the mean (and all moments if "t isnormally distributed)
A qt h-order moving average processis
yt D "t C 1"t 1C ::: C q"t q; (2.16)where the innovation "t is white noise (usually Gaussian) We could also allow both ytand "t to be vectors; such a process it called a vector MA (VMA)
Example 2.2 The mean of an MA(1),yt D "tC 1"t 1, is zero since the mean of"t (and
"t 1) is zero The first three autocovariance are
Var.yt/D E "t C 1"t 1/ "tC 1"t 1/D 2 1C 12
Cov.yt 1; yt/D E "t 1C 1"t 2/ "t C 1"t 1/D 21
Cov.yt 2; yt/D E "t 2C 1"t 3/ "t C 1"t 1/D 0; (2.18)
Trang 27and Cov.yt s; yt/ D 0 for jsj 2 Since both the mean and the covariances are finiteand constant acrosst , the MA(1) is covariance stationary Since the absolute value ofthe covariances sum to a finite number, the MA(1) is also ergodic for the mean The firstautocorrelation of an MA(1) is
Corr.yt 1; yt/D 1
1C 12
:
Since the white noise process is covariance stationary, and since an MA.q/ with m <
1 is a finite order linear function of "t, it must be the case that the MA.q/ is covariancestationary It is ergodic for the mean since Cov.yt s; yt/ D 0 for s > q, so (2.8) issatisfied As usual, Gaussian innovations are then sufficient for the MA(q) to be ergodicfor all moments
The effect of "t on yt, yt C1; :::, that is, the impulse response function, is the same asthe MA coefficients
yt D "t C 1"t 1C ::: C q"t q
yt C1 D "t C1C 1"t C ::: C q"t qC1::
:
yt Cq D "t CqC 1"t 1CqC ::: C q"t
yt CqC1 D "t CqC1C 1"t Cq C ::: C q"t C1:The expected value of yt, conditional onf"wgt swD 1is
Trang 28The forecasts made in t D 2 then have the follow expressions—with an example using
1 D 2; "1D 3=4 and "2 D 1=2 in the second column
ytj f"t s; "t s 1; : : :g N ŒEt syt; Var.yt Et syt/ (2.22)
Ns"t s C ::: C q"t q; 2 1C 12C ::: C s 12 : (2.23)The conditional mean is the point forecast and the variance is the variance of the forecasterror Note that if s > q, then the conditional distribution coincides with the unconditionaldistribution since "t s for s > q is of no help in forecasting yt
Example 2.5 (MA(1) and convergence from conditional to unconditional distribution.)From examples 2.3 and 2.4 we see that the conditional distributions change according to
Trang 29(where˝2indicates the information set int D 2)
y3j ˝2 N E2y3; Var.y3 E2y3// D N 1; 1/
y4j ˝2 N E2y4; Var.y4 E2y4// D N 0; 5/
Note that the distribution of y4j ˝2 coincides with the asymptotic distribution
Estimation of MA processes is typically done by setting up the likelihood functionand then using some numerical method to maximize it
"
"1t
"2t
#:
All stationary AR(p) processes can be written on MA(1) form by repeated tion To do so we rewrite the AR(p) as a first order vector autoregression, VAR(1) Forinstance, an AR(2) xt D a1xt 1C a2xt 2C "t can be written as
substitu-"
xt
xt 1
#D
"
"t0
Trang 30Iterate backwards on (2.26)
yt D A Ayt 2C "t 1/C "t
D A2yt 2C A"t 1C "t::
:
D AKC1yt K 1C
KXsD0
0 2 0::
: ::: :::
377775andZ Dh z1 z2 zn
i:
Note that we therefore get
yt D
1XsD0
Trang 310 0.2
Figure 2.2: Conditional moments and distributions for different forecast horizons for theAR(1) process yt D 0:85yt 1C "t with y0 D 4 and Std."t/D 1
Example 2.9 (AR(1).) For the univariate AR(1) yt D ayt 1 C "t, the characteristicequation is.a / z D 0, which is only satisfied if the eigenvalue is D a The AR(1) istherefore stable (and stationarity) if 1 < a < 1 This can also be seen directly by notingthataKC1yt K 1declines to zero if0 < a < 1 as K increases
Similarly, most finite order MA processes can be written (“inverted”) as AR.1/ It istherefore common to approximate MA processes with AR processes, especially since thelatter are much easier to estimate
Example 2.10 (Variance of AR(1).) From the MA-representationyt DP1
sD0as"t sandthe fact that "t is white noise we get Var.yt/ D 2P1
sD0a2s D 2= 1 a2 Notethat this is minimized ata D 0 The autocorrelations are obviously ajsj The covariancematrix offytgTt D1is therefore (standard deviationstandard deviationautocorrelation)
2
2
66666664
Example 2.11 (Covariance stationarity of an AR(1) withjaj < 1.) From the MA-representation
sD0as"t s, the expected value of yt is zero, since E"t s D 0 We know thatCov(yt; yt s)D ajsj2= 1 a2 which is constant and finite
Trang 32Example 2.12 (Ergodicity of a stationary AR(1).) We know that Cov(yt; yt s)D ajsj2= 1 a2,
so the absolute value is
jCov.yt; yt s/j D jajjsj2= 1 a2Using this in (2.8) gives
1XsD0jCov yt s; yt/j D
2
1XsD0jajs
2s 1
2:
The distribution ofyt Cs conditional on yt is normal with these parameters See Figure
2.2 for an example
2.5.1 Estimation of an AR(1) Process
Suppose we have samplefytgTt D0of a process which we know is an AR.p/, yt D ayt 1C
"t, with normally distributed innovations with unknown variance 2
Trang 33Recall that the joint and conditional pdfs of some variables z and x are related as
2exp y2 ay1/2C y1 ay0/2
t D1.yt a1yt 1/2
!
Taking logs, and evaluating the first order conditions for 2 and a gives the usual OLSestimator Note that this is MLE conditional on y0 There is a corresponding exact MLE,but the difference is usually small (the asymptotic distributions of the two estimators arethe same under stationarity; under non-stationarity OLS still gives consistent estimates).The MLE of Var("t) is given byPT
t D1 Ov2t=T , where Ovt is the OLS residual
These results carry over to any finite-order VAR The MLE, conditional on the initialobservations, of the VAR is the same as OLS estimates of each equation The MLE ofthe ijt h element in Cov("t) is given by PTt D1 Ovi tOvjt=T , where Ovi t and Ovjt are the OLSresiduals
To get the exact MLE, we need to multiply (2.33) with the unconditional pdf of y0(since we have no information to condition on)
p22=.1 a2/exp
y0222=.1 a2/
since y0 N.0; 2=.1 a2// The optimization problem is then non-linear and must besolved by a numerical optimization routine
Trang 342.5.2 Lag Operators
A common and convenient way of dealing with leads and lags is the lag operator, L It issuch that
Lsyt D yt s for all (integer) s
For instance, the ARMA(2,1) model
yt a1yt 1 a2yt 2D "t C 1"t 1 (2.35)can be written as
1 a1L a2L2 yt D 1 C 1L/ "t; (2.36)which is usually denoted
TX
t D1
xtxt0
1T
TX
t D1
xt Dh yt 1 yt 2 yt p
i:The first term in (2.38) is the inverse of the sample estimate of covariance matrix of
xt (since Eyt D 0), which converges in probability to ˙xx1 (yt is stationary and ergodicfor all moments if "t is Gaussian) The last term, T1 PTt D1xt"t, is serially uncorrelated,
so we can apply a CLT Note that Ext"t"0tx0t DE"t"0tExtx0t D 2˙xx since ut and xt areindependent We therefore have
1pT
TX
t D1
Trang 35Combining these facts, we get the asymptotic distribution
p
Consistency follows from taking plim of (2.38)
plim OˇLS ˇD ˙xx1plim 1
T
TX
t D1
xt"t
D 0;
since xt and "t are uncorrelated
2.5.4 Autoregressions versus Autocorrelations
It is straightforward to see the relation between autocorrelations and the AR model whenthe AR model is the true process This relation is given by the Yule-Walker equations.For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi-cient For an AR(2), yt D a1yt 1C a2yt 2C "t, we have
5 D
264
Cov.yt; a1yt 1C a2yt 2C "t/Cov.yt 1; a1yt 1C a2yt 2C "t/Cov.yt 2; a1yt 1C a2yt 2C "t/
375
D
2
64
a1Cov.yt; yt 1/C a2Cov.yt; yt 2/C Cov.yt; "t/
0 1
2
37
5 D
264
a1 1C a2 2C Var."t/
a1 0C a2 1
a1 1C a2 0
37
"
a1C a21
a11C a2
#or
"
1
2
#D
Trang 36can solve for the autoregression coefficients This demonstrates that testing that all theautocorrelations are zero is essentially the same as testing if all the autoregressive coeffi-cients are zero Note, however, that the transformation is non-linear, which may make adifference in small samples.
An ARMA model has both AR and MA components For instance, an ARMA(p,q) is
yt D a1yt 1C a2yt 2C ::: C apyt pC "t C 1"t 1C ::: C q"t q: (2.43)Estimationof ARMA processes is typically done by setting up the likelihood function andthen using some numerical method to maximize it
Even low-order ARMA models can be fairly flexible For instance, the ARMA(1,1)model is
yt D ayt 1C "t C "t 1, where "t is white noise (2.44)The model can be written on MA(1) form as
yt D "t C
1XsD1
Trang 37where "t is white noise.
A unit root process can be made stationary only by taking a difference The simplestexample is the random walk with drift
where "t is white noise The name “unit root process” comes from the fact that the largest
Trang 38eigenvalues of the canonical form (the VAR(1) form of the AR(p)) is one Such a process
is said to be integrated of order one (often denoted I(1)) and can be made stationary bytaking first differences
Example 2.14 (Non-stationary AR(2).) The processyt D 1:5yt 1 0:5yt 2C "t can bewritten
"
yt
yt 1
#D
"
"t0
#
;where the matrix has the eigenvalues 1 and 0.5 and is therefore non-stationary Note thatsubtractingyt 1from both sides givesyt yt 1 D 0:5 yt 1 yt 2/C"t, so the variable
xt D yt yt 1is stationary
The distinguishing feature of unit root processes is that the effect of a shock nevervanishes This is most easily seen for the random walk Substitute repeatedly in (2.49) toget
yt D C C yt 2C "t 1/C "t::
:
D t C y0C
tXsD1
The effect of "t never dies out: a non-zero value of "t gives a permanent shift of the level
of yt This process is clearly non-stationary A consequence of the permanent effect of
a shock is that the variance of the conditional distribution grows without bound as theforecasting horizon is extended For instance, for the random walk with drift, (2.50), thedistribution conditional on the information in t D 0 is N y0C t; s2 if the innova-tions are Gaussian This means that the expected change is t and that the conditionalvariance grows linearly with the forecasting horizon The unconditional variance is there-fore infinite and the standard results on inference are not applicable
In contrast, the conditional distributions from the trend stationary model, (2.48), is
N st; 2
A process could have two unit roots (integrated of order 2: I(2)) In this case, we need
to difference twice to make it stationary Alternatively, a process can also be explosive,that is, have eigenvalues outside the unit circle In this case, the impulse response functiondiverges
Trang 39Example 2.15 (Two unit roots.) Supposeyt in Example (2.14) is actually the first ence of some other series,yt D zt zt 1 We then have
differ-zt zt 1D 1:5 zt 1 zt 2/ 0:5 zt 2 zt 3/C "t
zt D 2:5zt 1 2zt 2C 0:5zt 3C "t;which is an AR(3) with the following canonical form
264
zt
zt 1
zt 2
37
5 D
264
375
264
zt 1
zt 2
zt 3
37
5 C
264
"t00
37
eigen-2.7.2 Spurious Regressions
Strong trends often causes problems in econometric models where yt is regressed on xt
In essence, if no trend is included in the regression, then xt will appear to be significant,just because it is a proxy for a trend The same holds for unit root processes, even ifthey have no deterministic trends However, the innovations accumulate and the seriestherefore tend to be trending in small samples A warning sign of a spurious regression iswhen R2> DW statistics
For trend-stationary data, this problem is easily solved by detrending with a lineartrend (before estimating or just adding a trend to the regression)
However, this is usually a poor method for a unit root processes What is needed is afirst difference For instance, a first difference of the random walk is
yt D yt yt 1
which is white noise (any finite difference, like yt yt s, will give a stationary series),
Trang 40so we could proceed by applying standard econometric tools to yt.
One may then be tempted to try first-differencing all non-stationary series, since itmay be hard to tell if they are unit root process or just trend-stationary For instance, afirst difference of the trend stationary process, (2.48), gives
Its unclear if this is an improvement: the trend is gone, but the errors are now of MA(1)type (in fact, non-invertible, and therefore tricky, in particular for estimation)
2.7.3 Testing for a Unit Root I
Suppose we run an OLS regression of
where the true value ofjaj < 1 The asymptotic distribution is of the LS estimator is
p
(The variance follows from the standard OLS formula where the variance of the estimator
is 2.X0X=T / 1 Here plim X0X=T DVar.yt/ which we know is 2= 1 a2)
It is well known (but not easy to show) that when a D 1, then Oa is biased towardszero in small samples In addition, the asymptotic distribution is no longer (2.54) Infact, there is a discontinuity in the limiting distribution as we move from a stationary/to
a non-stationary variable This, together with the small sample bias means that we have
to use simulated critical values for testing the null hypothesis of aD 1 based on the OLSestimate from (2.53)
The approach is to calculate the test statistic
Std.Oa/;and reject the null of non-stationarity if t is less than the critical values published byDickey and Fuller (typically more negative than the standard values to compensate for thesmall sample bias) or from your own simulations
In principle, distinguishing between a stationary and a non-stationary series is very