1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Class Notes in Statistics and Econometrics Part 9 pptx

39 222 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Causality and Inference
Trường học Unknown University
Chuyên ngành Statistics and Econometrics
Thể loại Lecture Notes
Năm xuất bản Unknown
Thành phố Unknown
Định dạng
Số trang 39
Dung lượng 404,59 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

After this, Rubin describes his own model developed together with Holland.Rubin introduces “counterfactual” or, as Bhaskar would say, “transfactual” el-ements since he is not only talkin

Trang 1

CHAPTER 17

Causality and Inference

This chapter establishes the connection between critical realism and Holland andRubin’s modelling of causality in statistics as explained in [Hol86] and [WM83, pp.3–25] (and the related paper [LN81] which comes from a Bayesian point of view) Adifferent approach to causality and inference, [Roy97], is discussed in chapter/section

2.8 Regarding critical realism and econometrics, also [Dow99] should be mentioned:this is written by a Post Keynesian econometrician working in an explicitly realistframework

Everyone knows that correlation does not mean causality Nevertheless, rience shows that statisticians can on occasion make valid inferences about causal-ity It is therefore legitimate to ask: how and under which conditions can causal

Trang 2

expe-conclusions be drawn from a statistical experiment or a statistical investigation ofnonexperimental data?

Holland starts his discussion with a description of the “logic of association”(= a flat empirical realism) as opposed to causality (= depth realism) His modelfor the “logic of association” is essentially the conventional mathematical model ofprobability by a setU of “all possible outcomes,” which we described and criticized

on p 12above

After this, Rubin describes his own model (developed together with Holland).Rubin introduces “counterfactual” (or, as Bhaskar would say, “transfactual”) el-ements since he is not only talking about the value a variable takes for a givenindividual, but also the value this variable would have taken for the same individual

if the causing variables (which Rubin also calls “treatments”) had been different.For simplicity, Holland assumes here that the treatment variable has only two levels:either the individual receives the treatment, or he/she does not (in which case he/shebelongs to the “control” group) The correlational view would simply measure theaverage response of those individuals who receive the treatment, and of those whodon’t Rubin recognizes in his model that the same individual may or may not besubject to the treatment, therefore the response variable has two values, one beingthe individual’s response if he or she receives the treatment, the other the response

if he or she does not

Trang 3

A third variable indicates who receives the treatment I.e, he has the “causal dicator”swhich can take two values, t (treatment) and c (control), and two variables

in-yt and yc, which, evaluated at individual ω, indicate the responses this individualwould give in case he was subject to the treatment, and in case he was or not

Rubin defines yt−yc to be the causal effect of treatment t versus the control

c But this causal effect cannot be observed We cannot observe how those viuals who received the treatement would have responded if they had not receivedthe treatment, despite the fact that this non-actualized response is just as real asthe response which they indeed gave This is what Holland calls the FundamentalProblem of Causal Inference

indi-Problem 225 Rubin excludes race as a cause because the individual cannot doanything about his or her race Is this argument justified?

Does this Fundamental Problem mean that causal inference is impossible? Hereare several scenarios in which causal inference is possible after all:

• Temporal stability of the response, and transience of the causal effect

• Unit homogeneity

• Constant effect, i.e., yt(ω) − yc(ω) is the same for allω

• Independence of the response with respect to the selection process regardingwho gets the treatment

Trang 4

For an example of this last case, say

Problem 226 Our universal set U consists of patients who have a certain ease We will explore the causal effect of a given treatment with the help of threeevents, T, C, and S, the first two of which are counterfactual, compare [Hol86].These events are defined as follows: T consists of all patients who would recover

dis-if given treatment; C consists of all patients who would recover if not given ment (i.e., if included in the control group) The event S consists of all patientsactually receiving treatment The average causal effect of the treatment is defined asPr[T] − Pr[C]

treat-• a 2 points Show that

Pr[T] = Pr[T|S] Pr[S] + Pr[T|S0](1 − Pr[S])(17.0.6)

and that

Pr[C] = Pr[C|S] Pr[S] + Pr[C|S0](1 − Pr[S])(17.0.7)

Which of these probabilities can be estimated as the frequencies of observable outcomesand which cannot?

Answer This is a direct application of ( 2.7.9 ) The problem here is that for all ω ∈ C, i.e., for those patients who do not receive treatment, we do not know whether they would have recovered

Trang 5

if given treatment, and for all ω ∈ T , i.e., for those patients who do receive treatment, we do not know whether they would have recovered if not given treatment In other words, neither Pr[T |S] nor E[C|S 0 ] can be estimated as the frequencies of observable outcomes 

•b 2 points Assume now thatSis independent ofT andC, because the subjectsare assigned randomly to treatment or control How can this be used to estimate thoseelements in the equations (17.0.6) and (17.0.7) which could not be estimated before?

Answer In this case, Pr[T |S] = Pr[T |S 0 ] and Pr[C|S 0 ] = Pr[C|S] Therefore, the average causal effect can be simplified as follows:

Pr[T ] − Pr[C] = Pr[T |S] Pr[S] + Pr[T |S0](1 − Pr[S]) − Pr[C|S] Pr[S] + Pr[C|S0](1 − Pr[S])

= Pr[T |S] Pr[S] + Pr[T |S](1 − Pr[S]) − Pr[C|S0] Pr[S] + Pr[C|S0](1 − Pr[S])

= Pr[T |S] − Pr[C|S0] (17.0.8)

Trang 6

The main message of the paper is therefore: before drawing causal conclusionsone should acertain whether one of these conditions apply which make causal con-clusions possible.

In the rest of the paper, Holland compares his approach with other approaches.Suppes’s definitions of causality are interesting:

• If r < s denote two time values, event Cr is a prima facie cause of Es iffPr[Es|Cr] > Pr[Es]

• Cris a spurious cause of Esiff it is a prima facie cause of Esand for some

q < r < s there is an event Dq so that Pr[Es|Cr, Dq] = Pr[Es|Dq] andPr[Es|Cr, Dq] ≥ Pr[Es|Cr]

• Event Cr is a genuine cause of Esiff it is a prima facie but not a spuriouscause

This is quite different than Rubin’s analysis Suppes concentrates on the causes of agiven effect, not the effects of a given cause Suppes has a Popperian falsificationistview: a hypothesis is good if one cannot falsify it, while Holland has the depth-realistview which says that the empirical is only a small part of reality, and which looks atthe underlying mechanisms

Problem227 Construct an example of a probability field with a spurious cause

Trang 7

Granger causality (see chapter/section 67.2.1) is based on the idea: knowing

a cause ought to improve our ability to predict It is more appropriate to speakhere of “noncausality” instead of causality: a variable does not cause another ifknowing that variable does not improve our ability to predict the other variable.Granger formulates his theory in terms of a specific predictor, the BLUP, whileHolland extends it to all predictors Granger works on it in a time series framework,while Holland gives a more general formulation Holland’s formulation strips off theunnecessary detail in order to get at the essence of things Holland defines: xis not

a Granger cause ofyrelative to the information inz(which in the timeseries contextcontains the past values of y) if and only if x and y are conditionally independentgivenz Problem40explains why this can be tested by testing predictive power

Trang 9

CHAPTER 18

Mean-Variance Analysis in the Linear Model

In the present chapter, the only distributional assumptions are that means andvariances exist (From this follows that also the covariances exist)

18.1 Three Versions of the Linear Model

As background reading please read [CD97, Chapter 1]

Following [JHG+88, Chapter 5], we will start with three different linear cal models Model 1 is the simplest estimation problem already familiar from chapter

statisti-12, with n independent observations from the same distribution, call themy1, ,yn.The only thing known about the distribution is that mean and variance exist, callthem µ and σ2 In order to write this as a special case of the “linear model,” define

Trang 10

εi=yi−µ, and define the vectorsy=

y1 y2 · · · yn>

,ε=ε1 ε2 · · · εn>

,and ι =1 1 · · · 1>

Then one can write the model in the form

The notationε∼ (o, σ2I) is shorthand for E[ε] = o (the null vector) and V[ε] = σ2I(σ2 times the identity matrix, which has 1’s in the diagonal and 0’s elsewhere) µ isthe deterministic part of all the yi, andεi is the random part

Model 2 is “simple regression” in which the deterministic part µ is not constantbut is a function of the nonrandom variable x The assumption here is that thisfunction is differentiable and can, in the range of the variation of the data, be ap-proximated by a linear function [Tin51, pp 19–20] I.e., each element of y is aconstant α plus a constant multiple of the corresponding element of the nonrandomvector x plus a random error term: yt = α + xtβ +εt, t = 1, , n This can bewritten as

(18.1.2)

y1

.1

.xn

.εn

.

1 xn

αβ

+

ε1

.εn

or

Trang 11

Problem 228 1 point Compute the matrix product

are linearly independent) Model 3 has Models 1 and 2 as special cases

Multiple regression is also used to “correct for” disturbing influences Let meexplain A functional relationship, which makes the systematic part ofy dependent

on some other variable x will usually only hold if other relevant influences are keptconstant If those other influences vary, then they may affect the form of this func-tional relation For instance, the marginal propensity to consume may be affected

by the interest rate, or the unemployment rate This is why some econometricians

Trang 12

(Hendry) advocate that one should start with an “encompassing” model with manyexplanatory variables and then narrow the specification down by hypothesis tests.Milton Friedman, by contrast, is very suspicious about multiple regressions, andargues in [FS91, pp 48/9] against the encompassing approach.

Friedman does not give a theoretical argument but argues by an example fromChemistry Perhaps one can say that the variations in the other influences may havemore serious implications than just modifying the form of the functional relation:they may destroy this functional relation altogether, i.e., prevent any systematic orpredictable behavior

observed unobserved

18.2 Ordinary Least Squares

In the modely= Xβ +ε, whereε∼ (o, σ2I), the OLS-estimateβˆis defined to

be that value β =βˆwhich minimizes

(18.2.1) SSE= (y− Xβ)>(y− Xβ) =y>y− 2y>Xβ + β>X>Xβ

Problem184 shows that in model 1, this principle yields the arithmetic mean

Trang 13

Problem 229 2 points Prove that, if one predicts a random variable y by aconstant a, the constant which gives the best MSE is a = E[y], and the best MSE onecan get is var[y].

Answer E[(y − a) 2 ] = E[y 2 ] − 2a E[y] + a 2 Differentiate with respect to a and set zero to get a = E[y] One can also differentiate first and then take expected value: E[2(y − a)] = 0 

We will solve this minimization problem using the first-order conditions in vectornotation As a preparation, you should read the beginning of Appendix C aboutmatrix differentiation and the connection between matrix differentiation and theJacobian matrix of a vector function All you need at this point is the two equations

for the present derivation

The matrix differentiation rules (C.1.6) and (C.1.7) allow us to differentiate

(18.2.2) ∂SSE/∂β>= −2y>X + 2β>X>X

Transpose it (because it is notationally simpler to have a relationship between columnvectors), set it zero while at the same time replacing β byβ, and divide by 2, to getˆthe “normal equation”

(18.2.3) X>y= X>Xβ.ˆ

Trang 14

Due to our assumption that all columns of X are linearly independent, X>X has

an inverse and one can premultiply both sides of (18.2.3) by (X>X)−1:

(18.2.4) βˆ= (X>X)−1X>y

If the columns of X are not linearly independent, then (18.2.3) has more than onesolution, and the normal equation is also in this case a necessary and sufficientcondition for βˆto minimize theSSE (proof in Problem232)

Problem 230 4 points Using the matrix differentiation rules

∂w>x/∂x>= w>

(18.2.5)

∂x>M x/∂x>= 2x>M(18.2.6)

for symmetric M , compute the least-squares estimateβˆwhich minimizes

(18.2.7) SSE= (y− Xβ)>(y− Xβ)

You are allowed to assume that X>X has an inverse

Answer First you have to multiply out

(18.2.8) (y − Xβ) > (y − Xβ) = y>y − 2y > Xβ + β>X>Xβ.

The matrix differentiation rules ( 18.2.5 ) and ( 18.2.6 ) allow us to differentiate (18.2.8) to get (18.2.9) ∂SSE/∂β>= −2y>X + 2β>X>X.

Trang 15

Transpose it (because it is notationally simpler to have a relationship between column vectors), set

it zero while at the same time replacing β by β, and divide by 2, to get the “normal equation”ˆ(18.2.10) X>y = X>X β.ˆ

Since X>X has an inverse, one can premultiply both sides of (18.2.10) by (X>X)−1:

(18.2.11) βˆ= (X>X)−1X>y.



Problem 231 2 points Show the following: if the columns of X are linearlyindependent, then X>X has an inverse (X itself is not necessarily square.) In yourproof you may use the following criteria: the columns of X are linearly independent(this is also called: X has full column rank) if and only if Xa = o implies a = o.And a square matrix has an inverse if and only if its columns are linearly independent

Answer We have to show that any a which satisfies X>Xa = o is itself the null vector From X>Xa = o follows a>X>Xa = 0 which can also be written kXak2= 0 Therefore Xa = o, and since the columns of X are linearly independent, this implies a = o 

Problem232 3 points In this Problem we do not assume that X has full columnrank, it may be arbitrary

• a The normal equation (18.2.3) has always at least one solution Hint: youare allowed to use, without proof, equation (A.3.3) in the mathematical appendix

Trang 16

Answer With this hint it is easy: βˆ= (X>X)−X>y is a solution 

• b If βˆsatisfies the normal equation and β is an arbitrary vector, then(18.2.12) (y− Xβ)>(y− Xβ) = (y− Xβ)ˆ>(y− Xβ) + (β −ˆ β)ˆ>X>X(β −β).ˆ

Answer This is true even if X has deficient rank, and it will be shown here in this general case To prove ( 18.2.12 ), write ( 18.2.1 ) as SSE = (y − Xβ) − X(β −ˆ β)ˆ>

(y − Xβ) − X(β −ˆ β)ˆ

; since βˆsatisfies ( 18.2.3 ), the cross product terms disappear 

• c Conclude from this that the normal equation is a necessary and sufficientcondition characterizing the valuesβˆminimizing the sum of squared errors (18.2.12)

Answer ( 18.2.12 ) shows that the normal equations are sufficient For necessity of the normal equations let βˆbe an arbitrary solution of the normal equation, we have seen that there is always

at least one Given β, it follows from (ˆ 18.2.12 ) that for any solution β ∗ of the minimization,

X>X(β∗−β) = o Use (ˆ 18.2.3) to replace (X> X) βˆby X>y to get X>Xβ∗= X>y 

It is customary to use the notation Xβˆ=ˆfor the so-called fitted values, whichare the estimates of the vector of means η = Xβ Geometrically, ˆis the orthogonalprojection ofyon the space spanned by the columns of X See TheoremA.6.1aboutprojection matrices

The vector of differences between the actual and the fitted values is called thevector of “residuals” εˆ=y−ˆ The residuals are “predictors” of the actual (but

Trang 17

unobserved) values of the disturbance vectorε An estimator of a random magnitude

is usually called a “predictor,” but in the linear model estimation and prediction are

treated on the same footing, therefore it is not necessary to distinguish between the

two

You should understand the difference between disturbances and residuals, and

between the two decompositions

(18.2.13) y= Xβ +ε= Xβˆ+εˆ

Problem233 2 points Assume that X has full column rank Show thatεˆ= My

where M = I − X(X>X)−1X> Show that M is symmetric and idempotent

Answer By definition, ε ˆ = y − X βˆ= y − X(X > X) −1 Xy = I − X(X > X) −1 Xy

Idem-potent, i.e M M = M :

M M = I − X(X>X)−1X> I − X(X>X)−1X> = I − X(X>X)−1X>− X(X > X)−1X>+ X(X>X)−1X>X(X>X)−1X>= I − 2X(X>X)−1X>+ X(X>X)−1X> = I − X(X>X)−1X>= M (18.2.14)



Problem234 Assume X has full column rank Define M = I−X(X>X)−1X>

• a 1 point Show that the space M projects on is the space orthogonal to all

columns in X, i.e., M q = q if and only if X>q = o

Trang 18

Answer X>q = o clearly implies M q = q Conversely, M q = q implies X(X>X)−1X>q =

o Premultiply this by X>to get X>q = o 

• b 1 point Show that a vector q lies in the range space of X, i.e., the spacespanned by the columns of X, if and only if M q = o In other words, {q : q = Xafor some a} = {q : M q = o}

Answer First assume M q = o This means q = X(X>X)−1X>q = Xa with a = (X>X)−1X>q Conversely, if q = Xa then M q = M Xa = Oa = o 

Problem 235 In 2-dimensional space, write down the projection matrix on thediagonal line y = x (call it E), and compute Ez for the three vectors a = [2],

b = [2], and c = [3] Draw these vectors and their projections

Assume we have a dependent variableyand two regressors x1and x2, each with

15 observations Then one can visualize the data either as 15 points in 3-dimensionalspace (a 3-dimensional scatter plot), or 3 points in 15-dimensional space In thefirst case, each point corresponds to an observation, in the second case, each pointcorresponds to a variable In this latter case the points are usually represented

as vectors You only have 3 vectors, but each of these vectors is a vector in dimensional space But you do not have to draw a 15-dimensional space to drawthese vectors; these 3 vectors span a 3-dimensional subspace, and ˆis the projection

15-of the vector y on the space spanned by the two regressors not only in the original

Trang 19

15-dimensional space, but already in this 3-dimensional subspace In other words,[DM93, Figure 1.3] is valid in all dimensions! In the 15-dimensional space, eachdimension represents one observation In the 3-dimensional subspace, this is nolonger true.

Problem 236 “Simple regression” is regression with an intercept and one planatory variable only, i.e.,

Ngày đăng: 04/07/2014, 15:20