After this, Rubin describes his own model developed together with Holland.Rubin introduces “counterfactual” or, as Bhaskar would say, “transfactual” el-ements since he is not only talkin
Trang 1CHAPTER 17
Causality and Inference
This chapter establishes the connection between critical realism and Holland andRubin’s modelling of causality in statistics as explained in [Hol86] and [WM83, pp.3–25] (and the related paper [LN81] which comes from a Bayesian point of view) Adifferent approach to causality and inference, [Roy97], is discussed in chapter/section
2.8 Regarding critical realism and econometrics, also [Dow99] should be mentioned:this is written by a Post Keynesian econometrician working in an explicitly realistframework
Everyone knows that correlation does not mean causality Nevertheless, rience shows that statisticians can on occasion make valid inferences about causal-ity It is therefore legitimate to ask: how and under which conditions can causal
Trang 2expe-conclusions be drawn from a statistical experiment or a statistical investigation ofnonexperimental data?
Holland starts his discussion with a description of the “logic of association”(= a flat empirical realism) as opposed to causality (= depth realism) His modelfor the “logic of association” is essentially the conventional mathematical model ofprobability by a setU of “all possible outcomes,” which we described and criticized
on p 12above
After this, Rubin describes his own model (developed together with Holland).Rubin introduces “counterfactual” (or, as Bhaskar would say, “transfactual”) el-ements since he is not only talking about the value a variable takes for a givenindividual, but also the value this variable would have taken for the same individual
if the causing variables (which Rubin also calls “treatments”) had been different.For simplicity, Holland assumes here that the treatment variable has only two levels:either the individual receives the treatment, or he/she does not (in which case he/shebelongs to the “control” group) The correlational view would simply measure theaverage response of those individuals who receive the treatment, and of those whodon’t Rubin recognizes in his model that the same individual may or may not besubject to the treatment, therefore the response variable has two values, one beingthe individual’s response if he or she receives the treatment, the other the response
if he or she does not
Trang 3A third variable indicates who receives the treatment I.e, he has the “causal dicator”swhich can take two values, t (treatment) and c (control), and two variables
in-yt and yc, which, evaluated at individual ω, indicate the responses this individualwould give in case he was subject to the treatment, and in case he was or not
Rubin defines yt−yc to be the causal effect of treatment t versus the control
c But this causal effect cannot be observed We cannot observe how those viuals who received the treatement would have responded if they had not receivedthe treatment, despite the fact that this non-actualized response is just as real asthe response which they indeed gave This is what Holland calls the FundamentalProblem of Causal Inference
indi-Problem 225 Rubin excludes race as a cause because the individual cannot doanything about his or her race Is this argument justified?
Does this Fundamental Problem mean that causal inference is impossible? Hereare several scenarios in which causal inference is possible after all:
• Temporal stability of the response, and transience of the causal effect
• Unit homogeneity
• Constant effect, i.e., yt(ω) − yc(ω) is the same for allω
• Independence of the response with respect to the selection process regardingwho gets the treatment
Trang 4For an example of this last case, say
Problem 226 Our universal set U consists of patients who have a certain ease We will explore the causal effect of a given treatment with the help of threeevents, T, C, and S, the first two of which are counterfactual, compare [Hol86].These events are defined as follows: T consists of all patients who would recover
dis-if given treatment; C consists of all patients who would recover if not given ment (i.e., if included in the control group) The event S consists of all patientsactually receiving treatment The average causal effect of the treatment is defined asPr[T] − Pr[C]
treat-• a 2 points Show that
Pr[T] = Pr[T|S] Pr[S] + Pr[T|S0](1 − Pr[S])(17.0.6)
and that
Pr[C] = Pr[C|S] Pr[S] + Pr[C|S0](1 − Pr[S])(17.0.7)
Which of these probabilities can be estimated as the frequencies of observable outcomesand which cannot?
Answer This is a direct application of ( 2.7.9 ) The problem here is that for all ω ∈ C, i.e., for those patients who do not receive treatment, we do not know whether they would have recovered
Trang 5if given treatment, and for all ω ∈ T , i.e., for those patients who do receive treatment, we do not know whether they would have recovered if not given treatment In other words, neither Pr[T |S] nor E[C|S 0 ] can be estimated as the frequencies of observable outcomes
•b 2 points Assume now thatSis independent ofT andC, because the subjectsare assigned randomly to treatment or control How can this be used to estimate thoseelements in the equations (17.0.6) and (17.0.7) which could not be estimated before?
Answer In this case, Pr[T |S] = Pr[T |S 0 ] and Pr[C|S 0 ] = Pr[C|S] Therefore, the average causal effect can be simplified as follows:
Pr[T ] − Pr[C] = Pr[T |S] Pr[S] + Pr[T |S0](1 − Pr[S]) − Pr[C|S] Pr[S] + Pr[C|S0](1 − Pr[S])
= Pr[T |S] Pr[S] + Pr[T |S](1 − Pr[S]) − Pr[C|S0] Pr[S] + Pr[C|S0](1 − Pr[S])
= Pr[T |S] − Pr[C|S0] (17.0.8)
Trang 6The main message of the paper is therefore: before drawing causal conclusionsone should acertain whether one of these conditions apply which make causal con-clusions possible.
In the rest of the paper, Holland compares his approach with other approaches.Suppes’s definitions of causality are interesting:
• If r < s denote two time values, event Cr is a prima facie cause of Es iffPr[Es|Cr] > Pr[Es]
• Cris a spurious cause of Esiff it is a prima facie cause of Esand for some
q < r < s there is an event Dq so that Pr[Es|Cr, Dq] = Pr[Es|Dq] andPr[Es|Cr, Dq] ≥ Pr[Es|Cr]
• Event Cr is a genuine cause of Esiff it is a prima facie but not a spuriouscause
This is quite different than Rubin’s analysis Suppes concentrates on the causes of agiven effect, not the effects of a given cause Suppes has a Popperian falsificationistview: a hypothesis is good if one cannot falsify it, while Holland has the depth-realistview which says that the empirical is only a small part of reality, and which looks atthe underlying mechanisms
Problem227 Construct an example of a probability field with a spurious cause
Trang 7Granger causality (see chapter/section 67.2.1) is based on the idea: knowing
a cause ought to improve our ability to predict It is more appropriate to speakhere of “noncausality” instead of causality: a variable does not cause another ifknowing that variable does not improve our ability to predict the other variable.Granger formulates his theory in terms of a specific predictor, the BLUP, whileHolland extends it to all predictors Granger works on it in a time series framework,while Holland gives a more general formulation Holland’s formulation strips off theunnecessary detail in order to get at the essence of things Holland defines: xis not
a Granger cause ofyrelative to the information inz(which in the timeseries contextcontains the past values of y) if and only if x and y are conditionally independentgivenz Problem40explains why this can be tested by testing predictive power
Trang 9CHAPTER 18
Mean-Variance Analysis in the Linear Model
In the present chapter, the only distributional assumptions are that means andvariances exist (From this follows that also the covariances exist)
18.1 Three Versions of the Linear Model
As background reading please read [CD97, Chapter 1]
Following [JHG+88, Chapter 5], we will start with three different linear cal models Model 1 is the simplest estimation problem already familiar from chapter
statisti-12, with n independent observations from the same distribution, call themy1, ,yn.The only thing known about the distribution is that mean and variance exist, callthem µ and σ2 In order to write this as a special case of the “linear model,” define
Trang 10εi=yi−µ, and define the vectorsy=
y1 y2 · · · yn>
,ε=ε1 ε2 · · · εn>
,and ι =1 1 · · · 1>
Then one can write the model in the form
The notationε∼ (o, σ2I) is shorthand for E[ε] = o (the null vector) and V[ε] = σ2I(σ2 times the identity matrix, which has 1’s in the diagonal and 0’s elsewhere) µ isthe deterministic part of all the yi, andεi is the random part
Model 2 is “simple regression” in which the deterministic part µ is not constantbut is a function of the nonrandom variable x The assumption here is that thisfunction is differentiable and can, in the range of the variation of the data, be ap-proximated by a linear function [Tin51, pp 19–20] I.e., each element of y is aconstant α plus a constant multiple of the corresponding element of the nonrandomvector x plus a random error term: yt = α + xtβ +εt, t = 1, , n This can bewritten as
(18.1.2)
y1
.1
.xn
.εn
.
1 xn
αβ
+
ε1
.εn
or
Trang 11Problem 228 1 point Compute the matrix product
are linearly independent) Model 3 has Models 1 and 2 as special cases
Multiple regression is also used to “correct for” disturbing influences Let meexplain A functional relationship, which makes the systematic part ofy dependent
on some other variable x will usually only hold if other relevant influences are keptconstant If those other influences vary, then they may affect the form of this func-tional relation For instance, the marginal propensity to consume may be affected
by the interest rate, or the unemployment rate This is why some econometricians
Trang 12(Hendry) advocate that one should start with an “encompassing” model with manyexplanatory variables and then narrow the specification down by hypothesis tests.Milton Friedman, by contrast, is very suspicious about multiple regressions, andargues in [FS91, pp 48/9] against the encompassing approach.
Friedman does not give a theoretical argument but argues by an example fromChemistry Perhaps one can say that the variations in the other influences may havemore serious implications than just modifying the form of the functional relation:they may destroy this functional relation altogether, i.e., prevent any systematic orpredictable behavior
observed unobserved
18.2 Ordinary Least Squares
In the modely= Xβ +ε, whereε∼ (o, σ2I), the OLS-estimateβˆis defined to
be that value β =βˆwhich minimizes
(18.2.1) SSE= (y− Xβ)>(y− Xβ) =y>y− 2y>Xβ + β>X>Xβ
Problem184 shows that in model 1, this principle yields the arithmetic mean
Trang 13Problem 229 2 points Prove that, if one predicts a random variable y by aconstant a, the constant which gives the best MSE is a = E[y], and the best MSE onecan get is var[y].
Answer E[(y − a) 2 ] = E[y 2 ] − 2a E[y] + a 2 Differentiate with respect to a and set zero to get a = E[y] One can also differentiate first and then take expected value: E[2(y − a)] = 0
We will solve this minimization problem using the first-order conditions in vectornotation As a preparation, you should read the beginning of Appendix C aboutmatrix differentiation and the connection between matrix differentiation and theJacobian matrix of a vector function All you need at this point is the two equations
for the present derivation
The matrix differentiation rules (C.1.6) and (C.1.7) allow us to differentiate
(18.2.2) ∂SSE/∂β>= −2y>X + 2β>X>X
Transpose it (because it is notationally simpler to have a relationship between columnvectors), set it zero while at the same time replacing β byβ, and divide by 2, to getˆthe “normal equation”
(18.2.3) X>y= X>Xβ.ˆ
Trang 14Due to our assumption that all columns of X are linearly independent, X>X has
an inverse and one can premultiply both sides of (18.2.3) by (X>X)−1:
(18.2.4) βˆ= (X>X)−1X>y
If the columns of X are not linearly independent, then (18.2.3) has more than onesolution, and the normal equation is also in this case a necessary and sufficientcondition for βˆto minimize theSSE (proof in Problem232)
Problem 230 4 points Using the matrix differentiation rules
∂w>x/∂x>= w>
(18.2.5)
∂x>M x/∂x>= 2x>M(18.2.6)
for symmetric M , compute the least-squares estimateβˆwhich minimizes
(18.2.7) SSE= (y− Xβ)>(y− Xβ)
You are allowed to assume that X>X has an inverse
Answer First you have to multiply out
(18.2.8) (y − Xβ) > (y − Xβ) = y>y − 2y > Xβ + β>X>Xβ.
The matrix differentiation rules ( 18.2.5 ) and ( 18.2.6 ) allow us to differentiate (18.2.8) to get (18.2.9) ∂SSE/∂β>= −2y>X + 2β>X>X.
Trang 15Transpose it (because it is notationally simpler to have a relationship between column vectors), set
it zero while at the same time replacing β by β, and divide by 2, to get the “normal equation”ˆ(18.2.10) X>y = X>X β.ˆ
Since X>X has an inverse, one can premultiply both sides of (18.2.10) by (X>X)−1:
(18.2.11) βˆ= (X>X)−1X>y.
Problem 231 2 points Show the following: if the columns of X are linearlyindependent, then X>X has an inverse (X itself is not necessarily square.) In yourproof you may use the following criteria: the columns of X are linearly independent(this is also called: X has full column rank) if and only if Xa = o implies a = o.And a square matrix has an inverse if and only if its columns are linearly independent
Answer We have to show that any a which satisfies X>Xa = o is itself the null vector From X>Xa = o follows a>X>Xa = 0 which can also be written kXak2= 0 Therefore Xa = o, and since the columns of X are linearly independent, this implies a = o
Problem232 3 points In this Problem we do not assume that X has full columnrank, it may be arbitrary
• a The normal equation (18.2.3) has always at least one solution Hint: youare allowed to use, without proof, equation (A.3.3) in the mathematical appendix
Trang 16Answer With this hint it is easy: βˆ= (X>X)−X>y is a solution
• b If βˆsatisfies the normal equation and β is an arbitrary vector, then(18.2.12) (y− Xβ)>(y− Xβ) = (y− Xβ)ˆ>(y− Xβ) + (β −ˆ β)ˆ>X>X(β −β).ˆ
Answer This is true even if X has deficient rank, and it will be shown here in this general case To prove ( 18.2.12 ), write ( 18.2.1 ) as SSE = (y − Xβ) − X(β −ˆ β)ˆ>
(y − Xβ) − X(β −ˆ β)ˆ
; since βˆsatisfies ( 18.2.3 ), the cross product terms disappear
• c Conclude from this that the normal equation is a necessary and sufficientcondition characterizing the valuesβˆminimizing the sum of squared errors (18.2.12)
Answer ( 18.2.12 ) shows that the normal equations are sufficient For necessity of the normal equations let βˆbe an arbitrary solution of the normal equation, we have seen that there is always
at least one Given β, it follows from (ˆ 18.2.12 ) that for any solution β ∗ of the minimization,
X>X(β∗−β) = o Use (ˆ 18.2.3) to replace (X> X) βˆby X>y to get X>Xβ∗= X>y
It is customary to use the notation Xβˆ=ˆfor the so-called fitted values, whichare the estimates of the vector of means η = Xβ Geometrically, ˆis the orthogonalprojection ofyon the space spanned by the columns of X See TheoremA.6.1aboutprojection matrices
The vector of differences between the actual and the fitted values is called thevector of “residuals” εˆ=y−ˆ The residuals are “predictors” of the actual (but
Trang 17unobserved) values of the disturbance vectorε An estimator of a random magnitude
is usually called a “predictor,” but in the linear model estimation and prediction are
treated on the same footing, therefore it is not necessary to distinguish between the
two
You should understand the difference between disturbances and residuals, and
between the two decompositions
(18.2.13) y= Xβ +ε= Xβˆ+εˆ
Problem233 2 points Assume that X has full column rank Show thatεˆ= My
where M = I − X(X>X)−1X> Show that M is symmetric and idempotent
Answer By definition, ε ˆ = y − X βˆ= y − X(X > X) −1 Xy = I − X(X > X) −1 Xy
Idem-potent, i.e M M = M :
M M = I − X(X>X)−1X> I − X(X>X)−1X> = I − X(X>X)−1X>− X(X > X)−1X>+ X(X>X)−1X>X(X>X)−1X>= I − 2X(X>X)−1X>+ X(X>X)−1X> = I − X(X>X)−1X>= M (18.2.14)
Problem234 Assume X has full column rank Define M = I−X(X>X)−1X>
• a 1 point Show that the space M projects on is the space orthogonal to all
columns in X, i.e., M q = q if and only if X>q = o
Trang 18Answer X>q = o clearly implies M q = q Conversely, M q = q implies X(X>X)−1X>q =
o Premultiply this by X>to get X>q = o
• b 1 point Show that a vector q lies in the range space of X, i.e., the spacespanned by the columns of X, if and only if M q = o In other words, {q : q = Xafor some a} = {q : M q = o}
Answer First assume M q = o This means q = X(X>X)−1X>q = Xa with a = (X>X)−1X>q Conversely, if q = Xa then M q = M Xa = Oa = o
Problem 235 In 2-dimensional space, write down the projection matrix on thediagonal line y = x (call it E), and compute Ez for the three vectors a = [2],
b = [2], and c = [3] Draw these vectors and their projections
Assume we have a dependent variableyand two regressors x1and x2, each with
15 observations Then one can visualize the data either as 15 points in 3-dimensionalspace (a 3-dimensional scatter plot), or 3 points in 15-dimensional space In thefirst case, each point corresponds to an observation, in the second case, each pointcorresponds to a variable In this latter case the points are usually represented
as vectors You only have 3 vectors, but each of these vectors is a vector in dimensional space But you do not have to draw a 15-dimensional space to drawthese vectors; these 3 vectors span a 3-dimensional subspace, and ˆis the projection
15-of the vector y on the space spanned by the two regressors not only in the original
Trang 1915-dimensional space, but already in this 3-dimensional subspace In other words,[DM93, Figure 1.3] is valid in all dimensions! In the 15-dimensional space, eachdimension represents one observation In the 3-dimensional subspace, this is nolonger true.
Problem 236 “Simple regression” is regression with an intercept and one planatory variable only, i.e.,