Sample Correlation CoefficientsWhat is the distribution of the sample correlation coefficients, and also of thevarious multiple and partial correlation coefficients in the above model?.
Trang 1CHAPTER 63
Independent Observations from the Same
Multivariate Population
This Chapter discusses a model that is a special case of the model in Chapter
62.2, but it goes into more depth towards the end
63.1 Notation and Basic StatisticsNotational conventions are not uniform among the different books about mul-tivariate statistic Johnson and Wichern arrange the data in a r × n matrix X.Each column is a separate independent observation of a q vector with mean µ anddispersion matrix ΣΣΣ There are n observations
Trang 2We will choose an alternative notation, which is also found in the literature, andwrite the matrix as a n × r matrixY As before, each column represents a variable,and each row a usually independent observation.
Decompose Y into its row vectors as follows:
Notation: the ith sample variance is called sii (not s2
i, as one might perhapsexpect)
The sample means indicate location, the sample standard deviations dispersion,and the sample correlation coefficients linear relationship
Trang 3How do we get these descriptive statistics from the data Y through a matrixmanipulation? ¯y> =n1ι>Y; nowY−ι¯y>= (I −ιιn>)Y is the matrix of observationswith the appropriate sample mean taken out of each element, therefore
nW, and in order to get the sample correlation matrixR, use
In analogy to the formulas for variances and covariances of linear transformations
of a vector, one has the following formula for sample variances and covariances oflinear combinationsYa andYb: est.cov[Ya,Yb] = a>S(n)b
Trang 4Problem 517 Show that E[¯y] = µ and V[¯y] = n1ΣΣ (The latter identity can
be shown in two ways: once using the Kronecker product of matrices, and once bypartitioning Y into its rows.)
Answer E[¯ y ] = E[n1Y>ι] = n1(E[ Y ])>ι = 1nµι>ι = µ Using Kronecker products, one obtains from ¯ y>= 1nι>Y that
Trang 5The alternative way to do it is
V[¯ y ] = E[(¯ y − µ)(¯ y − µ) > ] (63.1.6)
= E[
1 n
X
i
( yi− µ)
1 n
or the columns of Y as the points Rows as points gives n points in r-dimensional
Trang 6space, the “scatterplot geometry.” If r = 2, this is the scatter plot of the two variablesagainst each other.
In this geometry, the sample mean is the center of balance or center of gravity.The dispersion of the observations around their mean defines a distance measure inthis geometry
The book introduces this distance by suggesting with its illustrations that thedata are clustered in hyperellipsoids The right way to introduce this distance would
be to say: we are not only interested in the r coordinates separately but also in anylinear combinations, then use our treatment of the Mahalanobis distance for a givenpopulation, and then transfer it to the empirical distribution given by the sample
In the other geometry, all observations of a given random variable form one point,here called “vector.” I.e., the basic entities are the columns of Y In this so-called
“vector geometry,” ¯xis the projection on the diagonal vector ι, and the correlationcoefficient is the cosine of the angle between the deviation vectors
Generalized sample variance is defined as determinant ofS Its geometric ition: in the scatter plot geometry it is proportional to the square of the volume ofthe hyperellipsoids, (see J&W, p 103), and in the geometry in which the observations
intu-of each variabe form a vector it is
(63.2.1) detS= (n − 1)−r(volume)2
Trang 7where the volume is that spanned by the deviation vectors.
63.3 Assumption of Normality
A more general version of this section is62.2.3
Assume that the yi, the row vectors of Y, are independent, and each is ∼
N(µ, ΣΣΣ) with ΣΣΣ positive definite Then the density function ofY is
2(yj− µ)>Σ−1(yj− µ)(63.3.1)
= (2π)−nr/2(det ΣΣΣ)−n/2exp −1
2X
j
(yj− µ)>Σ−1(yj− µ).(63.3.2)
The quadratic form in the exponent can be rewritten as follows:
Trang 8The first term can be simplified as follows:
Now we compute the maximum likelihood estimators: taking the maximum for
µ is simply µˆ= ¯y This leaves the concentrated likelihood function
µ fY(Y ) = (2π)−nr/2(det ΣΣΣ)−n/2exp−n
2tr(ΣΣ
−1S(n))
Trang 9To obtain the maximum likelihood estimate of ΣΣΣ one needs equation (A.8.21) inTheorem A.8.3in the Appendix and (62.2.15).
If one sets A = S(n)1/2Σ−1S(n)1/2, then tr A = tr(ΣΣ−1S(n)) and det A =(det ΣΣΣ)−1detS(n), in (62.2.15), therefore the concentrated likelihood function
Trang 10Let’s follow Johnson and Wichern’s example on their p 199 The matrix is
It is not so important how one gets the initial estimates of µ and ΣΣΣ: say ˜µ> =
6 1 4, and to get ˜ΣΣ take deviations from the mean, putting zeros in for themissing values (which will of course underestimate the variances), and divide bythe number of observations (Since we are talking maximum likelihood, there is noadjustment for degrees of freedom.)
1/2 1/4 11/4 1/2 3/4
Trang 11de-those we need predictions of the missing elements of Y, of their squares, and oftheir products with each other and with the observed elements ofY Our method ofpredicting is to take conditional expectations, assuming ˜µ and ˜ΣΣ are the true meanand dispersion matrix.
For the prediction of the upper lefthand corner element ofY, only the first row
of Y is relevant Partitioning this row into the observed and unobserved elementsgives
1 3/4 5/2
or
The conditional mean ofy1is the best linear predictor
E[y1|y2; ˜µ, ˜ΣΣ] =y∗1= ˜µ1+ ˜Σ12Σ˜−1
22(y2− ˜µ2)(63.4.4)
or in our numerical example
Trang 12(63.4.6)
E[(y1−y∗1)(y1−y∗1)>|y2; ˜µ, ˜ΣΣ] = E[(y1−y∗1)(y1−y∗1)>] = MSE [y∗1;y1] = ˜Σ11− ˜Σ12Σ˜−1
22Σ˜21.These two data are sufficient to compute E[y1y>1|y2; ˜µ, ˜ΣΣ] Fromy1=y1−y∗
Trang 13In our numerical example this gives
E[y211| · · · ] = 1/2 −1/4 11/2 3/4
3/4 5/2
−1
1/41
+ (5.73)2= 32.99(63.4.11)
Answer This is in Johnson and Wichern, p 200
Now switch back to the more usual notation, in which yi is the ith row vector
of Y and ¯y the vector of column means Since S(n) = 1
Trang 14Of course, in a similar, much simpler fashion one obtains
(63.4.17) E[¯y| all observed values in Y; ˜µ, ˜ΣΣ]
In our numerical example, therefore, we obtain
(63.4.18)
148.05 27.27 101.1827.27 6.97 20.50101.18 20.50 74.00
(63.4.19)
The next step is to plug those estimated values of Y>ι and Y>Y into thelikelihood function and get the maximum likelihood estimates of µ and ΣΣΣ, in otherwords, set mean and dispersion matrix equal to the sample mean vector and sample
Trang 15dispersion matrices computed from these complete sufficient statistics:
(63.4.20)
,(63.4.21)
then predict the missing observations anew
63.5 Wishart DistributionThe Wishart distribution is a multivariate generalization of the σ2χ2 The non-central Wishart is the distribution of Y>Y ifY is normally distributed as above.But we will be mainly interested in the central Wishart distribution
Trang 16is the number of degrees of freedom The following theorem is exactly parallel totheorem 10.4.3.
sym-Proof of sufficiency: If P2 = P with rank r, a r × n matrix T exists with
P = T>T and T T> = I Therefore Z>PZ = Z>T>TZ Define X = TZ.Writing xi for the column vectors ofX, we know C[xi,xj] = σijT T> = σijI Forthe rows of X this means they are independent of each other and each of them
∼N(o, ΣΣΣ) Since there are r rows, the result follows
Necessity: Take a vector c with c>ΣΣc = 1 Then c>zj∼N(0, 1) for each j, and
c>zjindependent of c>zkfor j 6= k ThereforeZc ∼N(o, I) It follows also TZc =
Xc ∼N(o, I) (the first vector having n and the second r components) Therefore
c>Z>PZc is distributed as a χ2, therefore we can use the necessity condition intheorem 10.4.3to show that P is idempotent
As an application it follows from (63.1.2) thatS(n)∼W(n − 1, ΣΣΣ)
Trang 17One can also show the following generalization of Craig’s theorem: IfZas above,thenZ>PZ is independent ofZ>QZ if and only if P Q = O.
63.6 Sample Correlation CoefficientsWhat is the distribution of the sample correlation coefficients, and also of thevarious multiple and partial correlation coefficients in the above model? Suffice it toremark at this point that this is a notoriously difficult question We will only look atone special case, which also illustrates the use of random orthogonal transformations.Look at the following scenario: our matrix Y hat two columns only, write it as
Trang 18is to compute the distribution of the sample correlation coefficient
(63.6.2) r= P(ui− ¯u)(vi− ¯v)
pP(ui− ¯u)2pP(vi− ¯v)2
if the true ρ is zero
We know that u ∼ N(o, σuuI) Under the null hypothesis, u is independent
of v, therefore its distribution conditionally on v is the same as its unconditionaldistribution Furthermore look at the matrix consisting of random elements
1/√
n(v1− ¯v)/√
n−2 In other words,conditionally onv, the following three variables are mutually independent and have
Trang 19the following distributions:
w1=√n¯u∼N(0, σuu)(63.6.4)
w2=P(ui− ¯u)(vi− ¯v)
pP(vi− ¯v)2 =r√
suu∼N(0, σuu)(63.6.5)
q=Xu2i − n¯u2−w22= (1 −r2)suu∼ σuuχ2n−2(63.6.6)
Since the values of v do not enter any of these distributions, these are also theunconditional distributions Therefore we can form a simple function ofrwhich has
Trang 21CHAPTER 64
Pooling of Cross Section and Time Series Data
Given m cross-sectional units, each of which has been observed for t time periods.The dependent variable for cross sectional unit i at time s is ysi There are also
k independent variables, and the value of the jth independent variable for crosssectional unit i at time s is xsij I.e., instead of a vector, the dependent variable is amatrix, and instead of a matrix, the independent variables form a 3-way array Wewill discuss three different models here which assign equal slope parameters to thedifferent cross-sectional units but which differ in their treatment of the intercept
Trang 2264.1 OLS ModelThe most restrictive model of the three assumes that all cross-sectional unitshave the same intercept µ I.e.,
y1 · · · ym is t × m, each of the Xiis t × k, the first ι is the t-vector
of ones and the second ι the m-vector of ones, µ is the intercept and β the k-vector ofslope coefficients, andE=
ε · · · ε the matrix of disturbances The notation
Trang 23X1β · · · Xmβ represents a matrix obtained by the multiplication of a 3-wayarray with a vector We assume vecE∼ o, σ2I.
If one vectorizes this one gets
+ vec(E) or vec(Y) =
ιι
.ι
Problem 520 1 point Show that vec(X1β · · · Xmβ) = Zβ with Z asjust defined
Trang 25In tiles, the between model is obtained from (64.1.2) by attaching ι/t t :
64.3 Dummy Variable Model (Fixed Effects)While maintaining the assumption that the cross sectional units have the sameslope parameters, we are now allowing a different intercept for each unit I.e., the
Trang 26(64.3.3) Y = ια>+X1β · · · Xmβ +E
whereY =
y1 · · · ym is t×m, each of the Xiis t×k, ι is the t-vector of ones, α
is the m-vector collecting all the intercept terms, β the k-vector of slope coefficients,
E=
ε · · · ε the matrix of disturbances We assume vecE∼ o, σ2I
Trang 27For estimation, it is convenient to vectorize (64.3.3) to get
(64.3.4)
y1
.
.. .
Trang 28[JHG+88] give a good example how such a model can arise: s is years, i is firms,
ysiis costs, and there is only one xsifor every firm (i.e k = 1), which is sales Thesefirms would have equal marginal costs but different fixed overhead charges
In principle (64.3.4) presents no estimation problems, it is OLS with lots ofdummy variables (if there are lots of cross-sectional units) But often it is advanta-geous to use the following sequential procedure: (1) in oder to getβˆregress
β+ residuals
without a constant term (but if you leave the constant term in, this does not mattereither, its coefficient will be exactly zero) Here D is the matrix which takes themean out I.e., take the mean out of everyyindividually and out of every X beforerunning the regression (2) Then you get each αˆi by the following equation:
Trang 29Answer Equation ( 64.3.4 ) has the form of ( 30.0.1 ) Define D = I − ιι>/t and W = I − K(K>K) −1 K>= I ⊗ D According to ( 30.0.3 ) and ( 30.0.4 ), βˆand the vector of residuals can be obtained by regressing W vec( Y ) on W Z, and if one plugs this estimate βˆback into the formula, then one obtains an estimate of α.
Without using the Kronecker product, this procedure can be described as follows: one gets the right βˆif one estimates ( 64.3.3 ) premultiplied by D Since Dι = o, this premultiplication removes the first parameter vector α from the regression, so that only
Trang 30Since the regressor is the column of ones, one can write down the result immediately:
(64.3.12) α ˆ i = ¯ yi− ¯ x>iβˆ
where ¯ yiis the mean of yi, and ¯ x>i is the row vector consisting of the column means of X i
To get the unbiased estimate of σ2, one can almost take thes2from the regression(64.3.9), one only has to adjust it for the numbers of degrees of freedom
Problem 523 We are working in the dummy-variable model for pooled data,which can be written as
(64.3.13) Y = ια>+X1β · · · Xmβ +E
where Y = y1 · · · ym is t × m, each of the Xi is t × k, ι is the t-vector ofones,Eis a t × m matrix of identically distributed independent error terms with zeromean, and α is a m-vector and β a k-vector of unknown nonrandom parameters
• a 3 points Describe in words the characteristics of this model and how it cancome about
Answer Each of the m units has a different intercept, slope is the same Equal marginal costs but different fixed costs
• b 4 points Describe the issues in estimating this model and how it should beestimated
Trang 31Answer After vectorization OLS is fine, but design matrix very big One can derive formulas that are easier to evaluate numerically because they involve smaller matrices, by exploiting the structure of the overall design matrix First estimate the slope parameters by sweeping out the
•c 3 points Set up anF-test testing whether the individual intercept parametersare indeed different, by running two separate regressions on the restricted and theunrestricted model and using the generic formula (42.1.4) for the F-test Describehow you would run the restricted and how the unrestricted model Give the number
of constraints, the number of observations, and the number of coefficients in theunrestricted model in terms of m, t, and k
Answer The unrestricted regression is the dummy variables regression which was described here: first form D Y and all the DX i , then run regression ( 64.3.9 ) without intercept, which is already enough to get the SSE r
Number of constraints is m − 1, number of observations is tm, and number of coefficients in the unrestricted model is k + m The test statistic is given in [ JHG+88 , (11.4.25) on p 475]:
(64.3.14) F =(SSEr−SSEu)/(m − 1)
SSE u /(mt − m − k)
Trang 32
• d 3 points An alternative model specification is the variance componentsmodel Describe it as well as you can, and discuss situations when it would be moreappropriate than the model above.
Answer If one believes that variances are similar, and if one is not interested in those ticular firms in the sample, but in all firms
par-Problem 524 3 points Enumerate as many commonalities and differences asyou can between the dummy variable model for pooling cross sectional and time seriesdata, and the seemingly unrelated regression model
Answer Both models involve different cross-sectional units in overlapping time intervals In the SUR model, the different equations are related through the disturbances only, while in the dummy variable model, no relationship at all is going through the disturbances, all the errors are independent! But in the dummy variable model, the equations are strongly related since all slope coefficients are equal in the different equations, only the intercepts may differ In the SUR model, there is no relationship between the parameters in the different equations, the parameter vectors may even be of differant lengths Unlike [ JHG + 88 ], I would not call the dummy variable model a special case of the SUR modiel, since I would no longer call it a SUR model if there are cross-equation