1. Trang chủ
  2. » Kinh Doanh - Tiếp Thị

Statistical models theory and practice revised edition

458 809 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 458
Dung lượng 15,91 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Features of the book • Authoritative guide by a well-known author with wide experience in ing, research, and consulting teach-• Will be of interest to anyone who deals with applied stati

Trang 3

This lively and engaging textbook explains the things you have to know

in order to read empirical papers in the social and health sciences, as well asthe techniques you need to build statistical models of your own The author,David A Freedman, explains the basic ideas of association and regression,and takes you through the current models that link these ideas to causality.The focus is on applications of linear models, including generalizedleast squares and two-stage least squares, with probits and logits for binaryvariables The bootstrap is developed as a technique for estimating bias andcomputing standard errors Careful attention is paid to the principles of sta-tistical inference There is background material on study design, bivariate re-gression, and matrix algebra To develop technique, there are computer labswith sample computer programs The book is rich in exercises, most withanswers

Target audiences include advanced undergraduates and beginning uate students in statistics, as well as students and professionals in the socialand health sciences The discussion in the book is organized around publishedstudies, as are many of the exercises Relevant journal articles are reprinted

grad-at the back of the book Freedman makes a thorough appraisal of the stgrad-atisti-cal methods in these papers and in a variety of other examples He illustratesthe principles of modeling, and the pitfalls The discussion shows you how

statisti-to think about the critical issues—including the connection (or lack of it)between the statistical models and the real phenomena

Features of the book

• Authoritative guide by a well-known author with wide experience in ing, research, and consulting

teach-• Will be of interest to anyone who deals with applied statistics

• No-nonsense, direct style

• Careful analysis of statistical issues that come up in substantive tions, mainly in the social and health sciences

applica-• Can be used as a text in a course or read on its own

• Developed over many years at Berkeley, thoroughly class tested

• Background material on regression and matrix algebra

• Plenty of exercises

• Extra material for instructors, including data sets and MATLAB code forlab projects (send email to solutions@cambridge.org)

Trang 4

David A Freedman (1938–2008) was Professor of Statistics at theUniversity of California, Berkeley He was a distinguished mathematicalstatistician whose theoretical research ranged from the analysis of martingaleinequalities, Markov processes, de Finetti’s theorem, consistency of Bayesestimators, sampling, the bootstrap, and procedures for testing and evaluat-ing models to methods for causal inference.

Freedman published widely on the application—and misapplication—

of statistics in the social sciences, including epidemiology, demography, lic policy, and law He emphasized exposing and checking the assumptionsthat underlie standard methods, as well as understanding how those methodsbehave when the assumptions are false—for example, how regression modelsbehave when fitted to data from randomized experiments He had a remark-able talent for integrating carefully honed statistical arguments with com-pelling empirical applications and illustrations, as this book exemplifies.Freedman was a member of the American Academy of Arts andSciences, and in 2003 received the National Academy of Science’s John J.Carty Award, for his “profound contributions to the theory and practice ofstatistics.”

pub-Cover illustration

The ellipse on the cover shows the region in the plane where a bivariate mal probability density exceeds a threshold level The correlation coefficient is0.50 The means ofx and y are equal So are the standard deviations The dashed

nor-line is both the major axis of the ellipse and the SD nor-line The solid nor-line gives theregression ofy on x The normal density (with suitable means and standard devi-

ations) serves as a mathematical idealization of the Pearson-Lee data on heights,discussed in chapter 2 Normal densities are reviewed in chapter 3

Trang 5

David A FreedmanUniversity of California, Berkeley

Trang 6

São Paulo, Delhi, Dubai, Tokyo

Cambridge University Press

The Edinburgh Building, Cambridge CB2 8RU, UK

First published in print format

Information on this title: www.cambridge.org/9780521112437

This publication is in copyright Subject to statutory exception and to the

provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.

Cambridge University Press has no responsibility for the persistence or accuracy

of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published in the United States of America by Cambridge University Press, New York

www.cambridge.org

Paperback eBook (EBL) Hardback

Trang 7

Foreword to the Revised Edition xi

Trang 9

6.4 Inferring causation by regression 91

Why not regression? 123

The latent-variable formulation 123

The second equation 134

Mechanics: bivariate probit 136

Why a model rather than a cross-tab? 138

Interactions 138

More on table 3 in Evans and Schwab 139

More on the second equation 139

Trang 10

8.2 Bootstrapping a model for energy demand 167

Trang 11

The Computer Labs 294

Appendix: Sample MATLAB Code 310

Reprints

Gibson on McCarthy 315

Evans and Schwab on Catholic Schools 343

Rindfuss et al on Education and Fertility 377

Schneider et al on Social Capital 402

Trang 13

Some books are correct Some are clear Some are useful Some areentertaining Few are even two of these This book is all four Statistical Models: Theory and Practice is lucid, candid and insightful, a joy to read.

We are fortunate that David Freedman finished this new edition before hisdeath in late 2008 We are deeply saddened by his passing, and we greatlyadmire the energy and cheer he brought to this volume—and many otherprojects—during his final months

This book focuses on half a dozen of the most common tools in appliedstatistics, presenting them crisply, without jargon or hyperbole It dissectsreal applications: a quarter of the book reprints articles from the social andlife sciences that hinge on statistical models It articulates the assumptionsnecessary for the tools to behave well and identifies the work that the as-sumptions do This clarity makes it easier for students and practitioners tosee where the methods will be reliable; where they are likely to fail, andhow badly; where a different method might work; and where no inference ispossible—no matter what tool somebody tries to sell them

Many texts at this level are little more than bestiaries of methods, senting dozens of tools with scant explication or insight, a cookbook,numbers-are-numbers approach “If the left hand side is continuous, use alinear model; fit by least-squares If the left hand side is discrete, use a logit

pre-or probit model; fit by maximum likelihood.” Presenting statistics this wayinvites students to believe that the resulting parameter estimates, standarderrors, and tests of significance are meaningful—perhaps even untanglingcomplex causal relationships They teach students to think scientific infer-ence is purely algorithmic Plug in the numbers; out comes science Thisundervalues both substantive and statistical knowledge

To select an appropriate statistical method actually requires carefulthought about how the data were collected and what they measure Dataare not “just numbers.” Using statistical methods in situations where the un-derlying assumptions are false can yield gold or dross—but more often dross

Statistical Models brings this message home by showing both good and

questionable applications of statistical tools in landmark research: a study

of political intolerance during the McCarthy period, the effect of Catholicschooling on completion of high school and entry into college, the relation-ship between fertility and education, and the role of government institutions

in shaping social capital Other examples are drawn from medicine and

Trang 14

epidemiology, including John Snow’s classic work on the cause of cholera—

a shining example of the success of simple statistical tools when paired withsubstantive knowledge and plenty of shoe leather These real applicationsbring the theory to life and motivate the exercises

The text is accessible to upper-division undergraduates and beginninggraduate students Advanced graduate students and established researcherswill also find new insights Indeed, the three of us have learned much byreading it and teaching from it

And those who read this textbook have not exhausted Freedman’s proachable work on these topics Many of his related research articles arecollected inStatistical Models and Causal Inference: A Dialogue with the Social Sciences (Cambridge University Press, 2009), a useful companion to

ap-this text The collection goes further into some applications mentioned in thetextbook, such as the etiology of cholera and the health effects of HormoneReplacement Therapy Other applications range from adjusting the censusfor undercount to quantifying earthquake risk Several articles address the-oretical issues raised in the textbook For instance, randomized assignment

in an experiment is not enough to justify regression: without further tions, multiple regression estimates of treatment effects are biased The col-lection also covers the philosophical foundations of statistics and methodsthe textbook does not, such as survival analysis

assump-Statistical Models: Theory and Practice presents serious applications

and the underlying theory without sacrificing clarity or accessibility man shows with wit and clarity how statistical analysis can inform and how

Freed-it can deceive This book is unlike any other, a treasure: an introductorybook that conveys some of the wisdom required to make reliable statisticalinferences It is an important part of Freedman’s legacy

David Collier, Jasjeet Singh Sekhon, and Philip B StarkUniversity of California, Berkeley

Trang 15

This book is primarily intended for advanced undergraduates or ning graduate students in statistics It should also be of interest to manystudents and professionals in the social and health sciences Although writ-ten as a textbook, it can be read on its own The focus is on applications oflinear models, including generalized least squares, two-stage least squares,probits and logits The bootstrap is explained as a technique for estimatingbias and computing standard errors.

begin-The contents of the book can fairly be described as what you have toknow in order to start reading empirical papers that use statistical models Theemphasis throughout is on the connection—or lack of connection—betweenthe models and the real phenomena Much of the discussion is organizedaround published studies; the key papers are reprinted for ease of reference.Some observers may find the tone of the discussion too skeptical If youare among them, I would make an unusual request: suspend belief until youfinish reading the book (Suspension of disbelief is all too easily obtained,but that is a topic for another day.)

The first chapter contrasts observational studies with experiments, andintroduces regression as a technique that may help to adjust for confounding

in observational studies There is a chapter that explains the regression line,and another chapter with a quick review of matrix algebra (At Berkeley, halfthe statistics majors need these chapters.) The going would be much easierwith students who know such material Another big plus would be a solidupper-division course introducing the basics of probability and statistics.Technique is developed by practice At Berkeley, we have lab sessionswhere students use the computer to analyze data There is a baker’s dozen ofthese labs at the back of the book, with outlines for several more, and thereare sample computer programs Data are available to instructors from thepublisher, along with source files for the labs and computer code: send email

to solutions@cambridge.org

A textbook is only as good as its exercises, and there are plenty of them

in the pages that follow Some are mathematical and some are hypothetical,providing the analogs of lemmas and counter-examples in a more conven-tional treatment On the other hand, many of the exercises are based onactual studies Here is a summary of the data and the analysis; here is a

Trang 16

specific issue: where do you come down? Answers to most of the exercisesare at the back of the book Beyond exercises and labs, students at Berkeleywrite papers during the semester Instructions for projects are also availablefrom the publisher.

A text is defined in part by what it chooses to discuss, and in part bywhat it chooses to ignore; the topics of interest are not to be covered in onebook, no matter how thick My objective was to explain how practitionersinfer causation from association, with the bootstrap as a counterpoint to theusual asymptotics Examining the logic of the enterprise is crucial, and thattakes time If a favorite technique has been slighted, perhaps this reasoningwill make amends

There is enough material in the book for 15–20 weeks of lectures anddiscussion at the undergraduate level, or 10–15 weeks at the graduate level.With undergraduates on the semester system, I cover chapters 1–7, and in-troduce simultaneity (sections 9.1–4) This usually takes 13 weeks If things

go quickly, I do the bootstrap (chapter 8), and the examples in chapter 9

On a quarter system with ten-week terms, I would skip the student tions and chapters 8–9; the bivariate probit model in chapter 7 could also bedispensed with

presenta-During the last two weeks of a semester, students present their projects,

or discuss them with me in office hours I often have a review period onthe last day of class For a graduate course, I supplement the material withadditional case studies and discussion of technique

The revised text organizes the chapters somewhat differently, whichmakes the teaching much easier The exposition has been improved in anumber of other ways, without (I hope) introducing new difficulties Thereare many new examples and exercises

Trang 17

(i) to summarize data,

(ii) to predict the future,

(iii) to predict the results of interventions

The third—causal inference—is the most interesting and the most slippery Itwill be our focus For background, this section covers some basic principles

of study design

Causal inferences are made from observational studies, natural iments, and randomized controlled experiments When using observational (non-experimental) data to make causal inferences, the key problem is con- founding Sometimes this problem is handled by subdividing the study pop- ulation (stratification, also called cross-tabulation), and sometimes by mod-

exper-eling These strategies have various strengths and weaknesses, which need

to be explored

Trang 18

In medicine and social science, causal inferences are most solid whenbased on randomized controlled experiments, where investigators assign sub-

jects at random—by the toss of a coin—to a treatment group or to a control group Up to random error, the coin balances the two groups with respect to

all relevant factors other than treatment Differences between the treatmentgroup and the control group are therefore due to treatment That is why causa-tion is relatively easy to infer from experimental data However, experimentstend to be expensive, and may be impossible for ethical or practical reasons.Then statisticians turn to observational studies

In an observational study, it is the subjects who assign themselves tothe different groups The investigators just watch what happens Studies onthe effects of smoking, for instance, are necessarily observational However,the treatment-control terminology is still used The investigators compare

smokers (the treatment group, also called the exposed group) with nonsmokers

(the control group) to determine the effect of smoking The jargon is a littleconfusing, because the word “control” has two senses:

(i) a control is a subject who did not get the treatment;

(ii) a controlled experiment is a study where the investigators decidewho will be in the treatment group

Smokers come off badly in comparison with nonsmokers Heart attacks,lung cancer, and many other diseases are more common among smokers

There is a strong association between smoking and disease If cigarettes

cause disease, that explains the association: death rates are higher for smokersbecause cigarettes kill Generally, association is circumstantial evidence forcausation However, the proof is incomplete There may be some hiddenconfounding factor which makes people smoke and also makes them sick

If so, there is no point in quitting: that will not change the hidden factor.Association is not the same as causation

Confounding means a difference between the treatment and trol groups—other than the treatment—which affects the responsebeing studied

con-Typically, a confounder is a third variable which is associated with exposureand influences the risk of disease

Statisticians like Joseph Berkson and R A Fisher did not believe theevidence against cigarettes, and suggested possible confounding variables.Epidemiologists (including Richard Doll and Bradford Hill in England, aswell as Wynder, Graham, Hammond, Horn, and Kahn in the United States)ran careful observational studies to show these alternative explanations were

Trang 19

not plausible Taken together, the studies make a powerful case that smokingcauses heart attacks, lung cancer, and other diseases If you give up smoking,you will live longer.

Epidemiological studies often make comparisons separately for smallerand more homogeneous groups, assuming that within these groups, subjectshave been assigned to treatment or control as if by randomization For ex-ample, a crude comparison of death rates among smokers and nonsmokerscould be misleading if smokers are disproportionately male, because men aremore likely than women to have heart disease and cancer Gender is there-fore a confounder To control for this confounder—a third use of the word

“control”—epidemiologists compared male smokers to male nonsmokers,and females to females

Age is another confounder Older people have different smoking habits,and are more at risk for heart disease and cancer So the comparison betweensmokers and nonsmokers was made separately by gender and age: for ex-ample, male smokers age 55–59 were compared to male nonsmokers in thesame age group This controls for gender and age Air pollution would be

a confounder, if air pollution causes lung cancer and smokers live in morepolluted environments To control for this confounder, epidemiologists madecomparisons separately in urban, suburban, and rural areas In the end, expla-nations for health effects of smoking in terms of confounders became very,very implausible

Of course, as we control for more and more variables this way, studygroups get smaller and smaller, leaving more and more room for chanceeffects This is a problem with cross-tabulation as a method for dealing withconfounders, and a reason for using statistical models Furthermore, mostobservational studies are less compelling than the ones on smoking Thefollowing (slightly artificial) example illustrates the problem

Example 1 In cross-national comparisons, there is a striking correlationbetween the number of telephone lines per capita in a country and the deathrate from breast cancer in that country This is not because talking on thetelephone causes cancer Richer countries have more phones and highercancer rates The probable explanation for the excess cancer risk is thatwomen in richer countries have fewer children Pregnancy—especially earlyfirst pregnancy—is protective Differences in diet and other lifestyle factorsacross countries may also play some role

Randomized controlled experiments minimize the problem of founding That is why causal inferences from randomized con-trolled experiments are stronger than those from observational stud-

Trang 20

con-ies With observational studies of causation, you always have toworry about confounding What were the treatment and controlgroups? How were they different, apart from treatment? Whatadjustments were made to take care of the differences? Are theseadjustments sensible?

The rest of this chapter will discuss examples: the HIP trial of mammography,Snow on cholera, and the causes of poverty

1.2 The HIP trial

Breast cancer is one of the most common malignancies among women inCanada and the United States If the cancer is detected early enough—before

it spreads—chances of successful treatment are better “Mammography”means screening women for breast cancer by X-rays Does mammographyspeed up detection by enough to matter? The first large-scale randomizedcontrolled experiment was HIP (Health Insurance Plan) in NewYork, followed

by the Two-County study in Sweden There were about half a dozen othertrials as well Some were negative (screening doesn’t help) but most werepositive By the late 1980s, mammography had gained general acceptance.The HIP study was done in the early 1960s HIP was a group medicalpractice which had at the time some 700,000 members Subjects in the experi-ment were 62,000 women age 40–64, members of HIP, who were randomized

to treatment or control “Treatment” consisted of invitation to 4 rounds ofannual screening—a clinical exam and mammography The control groupcontinued to receive usual health care Results from the first 5 years of fol-lowup are shown in table 1 In the treatment group, about 2/3 of the womenaccepted the invitation to be screened, and 1/3 refused Death rates (per 1000women) are shown, so groups of different sizes can be compared

Table 1 HIP data Group sizes (rounded), deaths in 5 years offollowup, and death rates per 1000 women randomized

Trang 21

Which rates show the efficacy of treatment? It seems natural to comparethose who accepted screening to those who refused However, this is an ob-servational comparison, even though it occurs in the middle of an experiment.The investigators decided which subjects would be invited to screening, but

it is the subjects themselves who decided whether or not to accept the tion Richer and better-educated subjects were more likely to participate thanthose who were poorer and less well educated Furthermore, breast cancer(unlike most other diseases) hits the rich harder than the poor Social status

invita-is therefore a confounder—a factor associated with the outcome and with thedecision to accept screening

The tip-off is the death rate from other causes (not breast cancer) in thelast column of table 1 There is a big difference between those who acceptscreening and those who refuse The refusers have almost double the risk ofthose who accept There must be other differences between those who acceptscreening and those who refuse, in order to account for the doubling in therisk of death from other causes—because screening has no effect on the risk.One major difference is social status It is the richer women who come

in for screening Richer women are less vulnerable to other diseases but morevulnerable to breast cancer So the comparison of those who accept screeningwith those who refuse is biased, and the bias is against screening

Comparing the death rate from breast cancer among those who accept

screening and those who refuse is analysis by treatment received This

analy-sis is seriously biased, as we have just seen The experimental comparison isbetween the whole treatment group—all those invited to be screened, whether

or not they accepted screening—and the whole control group This is the

intention-to-treat analysis.

Intention-to-treat is the recommended analysis

HIP, which was a very well-run study, made the intention-to-treat analysis.The investigators compared the breast cancer death rate in the total treatmentgroup to the rate in the control group, and showed that screening works.The effect of the invitation is small in absolute terms: 63− 39 = 24lives saved (table 1) Since the absolute risk from breast cancer is small, nointervention can have a large effect in absolute terms On the other hand, inrelative terms, the 5-year death rates from breast cancer are in the ratio 39/63=62% Followup continued for 18 years, and the savings in lives persistedover that period The Two-County study—a huge randomized controlledexperiment in Sweden—confirmed the results of HIP So did other studies

in Finland, Scotland, and Sweden That is why mammography became sowidely accepted

Trang 22

1.3 Snow on cholera

A natural experiment is an observational study where assignment to

treatment or control is as if randomized by nature In 1855, some twentyyears before Koch and Pasteur laid the foundations of modern microbiology,John Snow used a natural experiment to show that cholera is a waterborneinfectious disease At the time, the germ theory of disease was only one

of many theories Miasmas (foul odors, especially from decaying organicmaterial) were often said to cause epidemics Imbalance in the humors of thebody—black bile, yellow bile, blood, phlegm—was an older theory Poison

in the ground was an explanation that came into vogue slightly later.Snow was a physician in London By observing the course of the disease,

he concluded that cholera was caused by a living organism which entered thebody with water or food, multiplied in the body, and made the body expelwater containing copies of the organism The dejecta then contaminated food

or reentered the water supply, and the organism proceeded to infect othervictims Snow explained the lag between infection and disease—a matter ofhours or days—as the time needed for the infectious agent to multiply in thebody of the victim This multiplication is characteristic of life: inanimatepoisons do not reproduce themselves (Of course, poisons may take sometime to do their work: the lag is not compelling evidence.)

Snow developed a series of arguments in support of the germ theory Forinstance, cholera spread along the tracks of human commerce Furthermore,when a ship entered a port where cholera was prevalent, sailors contracted thedisease only when they came into contact with residents of the port Thesefacts were easily explained if cholera was an infectious disease, but were hard

to explain by the miasma theory

There was a cholera epidemic in London in 1848 Snow identified thefirst or “index” case in this epidemic:

“a seaman named John Harnold, who had newly arrived by the Elbe

steamer from Hamburgh, where the disease was prevailing.” [p 3]

He also identified the second case: a man named Blenkinsopp who tookHarnold’s room after the latter died, and became infected by contact with thebedding Next, Snow was able to find adjacent apartment buildings, one hardhit by cholera and one not In each case, the affected building had a watersupply contaminated by sewage, the other had relatively pure water Again,these facts are easy to understand if cholera is an infectious disease—but not

if miasmas are the cause

There was an outbreak of the disease in August and September of 1854.Snow made a “spot map,” showing the locations of the victims These clus-

Trang 23

tered near the Broad Street pump (Broad Street is in Soho, London; at thetime, public pumps were used as a source of drinking water.) By contrast,there were a number of institutions in the area with few or no fatalities Onewas a brewery The workers seemed to have preferred ale to water; if anywanted water, there was a private pump on the premises Another institutionalmost free of cholera was a poor-house, which too had its own private pump.(Poor-houses will be discussed again, in section 4.)

People in other areas of London did contract the disease In most cases,Snow was able to show they drank water from the Broad Street pump Forinstance, one lady in Hampstead so much liked the taste that she had waterfrom the Broad Street pump delivered to her house by carter

So far, we have persuasive anecdotal evidence that cholera is an tious disease, spread by contact or through the water supply Snow also usedstatistical ideas There were a number of water companies in the London ofhis time Some took their water from heavily contaminated stretches of theThames river For others, the intake was relatively uncontaminated

infec-Snow made “ecological” studies, correlating death rates from cholera invarious areas of London with the quality of the water Generally speaking,areas with contaminated water had higher death rates The Chelsea watercompany was exceptional This company started with contaminated water,but had quite modern methods of purification—with settling ponds and carefulfiltration Its service area had a low death rate from cholera

In 1852, the Lambeth water company moved its intake pipe upstream

to get purer water The Southwark and Vauxhall company left its intake pipewhere it was, in a heavily contaminated stretch of the Thames Snow made

an ecological analysis comparing the areas serviced by the two companies inthe epidemics of 1853–54 and in earlier years Let him now continue in hisown words

“Although the facts shown in the above table [the ecological analysis]afford very strong evidence of the powerful influence which the drinking ofwater containing the sewage of a town exerts over the spread of cholera, whenthat disease is present, yet the question does not end here; for the intermixing

of the water supply of the Southwark and Vauxhall Company with that ofthe Lambeth Company, over an extensive part of London, admitted of thesubject being sifted in such a way as to yield the most incontrovertible proof

on one side or the other In the subdistricts enumerated in the above table

as being supplied by both Companies, the mixing of the supply is of themost intimate kind The pipes of each Company go down all the streets,and into nearly all the courts and alleys A few houses are supplied by oneCompany and a few by the other, according to the decision of the owner oroccupier at that time when the Water Companies were in active competition

Trang 24

In many cases a single house has a supply different from that on either side.Each company supplies both rich and poor, both large houses and small;there is no difference either in the condition or occupation of the personsreceiving the water of the different Companies Now it must be evident that,

if the diminution of cholera, in the districts partly supplied with improvedwater, depended on this supply, the houses receiving it would be the housesenjoying the whole benefit of the diminution of the malady, whilst the housessupplied with the [contaminated] water from Battersea Fields would sufferthe same mortality as they would if the improved supply did not exist at all

As there is no difference whatever in the houses or the people receiving thesupply of the two Water Companies, or in any of the physical conditionswith which they are surrounded, it is obvious that no experiment could havebeen devised which would more thoroughly test the effect of water supply

on the progress of cholera than this, which circumstances placed ready madebefore the observer

“The experiment, too, was on the grandest scale No fewer than threehundred thousand people of both sexes, of every age and occupation, and ofevery rank and station, from gentlefolks down to the very poor, were dividedinto groups without their choice, and in most cases, without their knowledge;one group being supplied with water containing the sewage of London, andamongst it, whatever might have come from the cholera patients; the othergroup having water quite free from such impurity

“To turn this grand experiment to account, all that was required was

to learn the supply of water to each individual house where a fatal attack ofcholera might occur.” [pp 74–75]

Snow’s data are shown in table 2 The denominator data—the number ofhouses served by each water company—were available from parliamentaryrecords For the numerator data, however, a house-to-house canvass wasneeded to determine the source of the water supply at the address of eachcholera fatality (The “bills of mortality,” as death certificates were called atthe time, showed the address but not the water source for each victim.) Thedeath rate from the Southwark and Vauxhall water is about 9 times the deathrate for the Lambeth water Snow explains that the data could be analyzed as

Table 2 Death rate from cholera by source of water Rate per10,000 houses London Epidemic of 1854 Snow’s table IX

No of Houses Cholera Deaths Rate per 10,000

Trang 25

if they had resulted from a randomized controlled experiment: there was nodifference between the customers of the two water companies, except for thewater The data analysis is simple—a comparison of rates It is the design ofthe study and the size of the effect that compel conviction.

1.4 Yule on the causes of poverty

Legendre (1805) and Gauss (1809) developed regression techniques tofit data on orbits of astronomical objects The relevant variables were knownfrom Newtonian mechanics, and so were the functional forms of the equationsconnecting them Measurement could be done with high precision Muchwas known about the nature of the errors in the measurements and equations.Furthermore, there was ample opportunity for comparing predictions to real-ity A century later, investigators were using regression on social science datawhere these conditions did not hold, even to a rough approximation—withconsequences that need to be explored (chapters 4–9)

Yule (1899) was studying the causes of poverty At the time, paupers

in England were supported either inside grim Victorian institutions called

“poor-houses” or outside, depending on the policy of local authorities Didpolicy choices affect the number of paupers? To study this question, Yuleproposed a regression equation,

(1) Paup = a + b×Out + c×Old + d ×Pop + error.

In this equation,

 is percentage change over time,

Paup is the percentage of paupers,

Out is the out-relief ratioN/D,

N = number on welfare outside the poor-house,

D = number inside,

Old is the percentage of the population aged over 65,

Pop is the population

Data are from the English Censuses of 1871, 1881, 1891 There are two’s,

one for 1871–81 and one for 1881–91 (Error terms will be discussed later.)Relief policy was determined separately in each “union” (an administra-tive district comprising several parishes) At the time, there were about 600unions, and Yule divided them into four kinds: rural, mixed, urban, metropol-itan There are 4×2 = 8 equations, one for each type of union and time period.Yule fitted his equations to the data by least squares That is, he determined

a, b, c, and d by minimizing the sum of squared errors,

 

Paup − a − b×Out − c×Old − d ×Pop2

.

Trang 26

The sum is taken over all unions of a given type in a given time period, whichassumes (in effect) that coefficients are constant for those combinations ofgeography and time.

Table 3 Pauperism, Out-relief ratio, Proportion of Old, Population.Ratio of 1881 data to 1871 data, times 100 Metropolitan Unions,England Yule (1899, table XIX)

Paup Out Old Pop

Trang 27

For example, consider the metropolitan unions Fitting the equation tothe data for 1871–81, Yule got

(2) Paup = 13.19 + 0.755Out − 0.022Old − 0.322Pop + error.

For 1881–91, his equation was

(3) Paup = 1.36 + 0.324Out + 1.37Old − 0.369Pop + error.

The coefficient ofOut being relatively large and positive, Yule concludes

that out-relief causes poverty

Let’s take a look at some of the details Table 3 has the ratio of 1881data to 1871 data for Pauperism, Out-relief ratio, Proportion of Old, andPopulation If we subtract 100 from each entry in the table, column 1 gives

Paup in the regression equation (2); columns 2, 3, 4 give the other variables.

For Kensington (the first union in the table),

Out = 5 − 100 = −95, Old = 104 − 100 = 4, Pop = 136 − 100 = 36.

The predicted value forPaup from (2) is therefore

13.19 + 0.755×(−95) − 0.022×4 − 0.322×36 = −70.

The actual value forPaup is −73 So the error is −3 As noted before, the

coefficients were chosen by Yule to minimize the sum of squared errors (Inchapter 4, we will see how to do this.)

Look back at equation (2) The causal interpretation of the coefficient0.755 is this Other things being equal, ifOut is increased by 1 percent-

age point—the administrative district supports more people outside the house—thenPaup will go up by 0.755 percentage points This is a quan- titative inference Out-relief causes an increase in pauperism—a qualitative inference The point of introducing Pop and Old into the equation is to

poor-control for possible confounders, implementing the idea of “other things ing equal.” For Yule’s argument, it is important that the coefficient ofOut

be-be significantly positive Qualitative inferences are often the important ones;with regression, the two aspects are woven together

Quetelet (1835) wanted to uncover “social physics”—the laws of humanbehavior—by using statistical technique Yule was using regression to inferthe social physics of poverty But this is not so easily to be done Confounding

is one problem According to Pigou, a leading welfare economist of Yule’sera, districts with more efficient administrations were building poor-houses

Trang 28

and reducing poverty Efficiency of administration is then a confounder,influencing both the presumed cause and its effect Economics may be anotherconfounder Yule occasionally describes the rate of population change as aproxy for economic growth Generally, however, he pays little attention toeconomics The explanation:

“A good deal of time and labour was spent in making trial of this idea, butthe results proved unsatisfactory, and finally the measure was abandonedaltogether.” [p 253]

The form of Yule’s equation is somewhat arbitrary, and the coefficientsare not consistent across time and geography: compare equations (2) and (3)

to see differences across time Differences across geography are reported

in table C of Yule’s paper The inconsistencies may not be fatal However,unless the coefficients have some existence of their own—apart from thedata—how can they predict the results of interventions that would change thedata? The distinction between parameters and estimates is a basic one, and

we will return to this issue several times in chapters 4–9

There are other problems too At best, Yule has established association.Conditional on the covariates, there is a positive association betweenPaup

andOut Is this association causal? If so, which way do the causal arrows

point? For instance, a parish may choose not to build poor-houses in response

to a short-term increase in the number of paupers, in which case pauperismcauses out-relief Likewise, the number of paupers in one area may well beaffected by relief policy in neighboring areas Such issues are not resolved

by the data analysis Instead, answers are assumed a priori Yule’s enterprise

is substantially more problematic than Snow on cholera, or the HIP trial, orthe epidemiology of smoking

Yule was aware of the problems Although he was busily parceling outchanges in pauperism—so much is due to changes in the out-relief ratio, somuch to changes in other variables, and so much to random effects—there

is one deft footnote (number 25) that withdraws all causal claims: “Strictlyspeaking, for ‘due to’ read ‘associated with.’”

Yule’s approach is strikingly modern, except there is no causal diagramwith stars to indicate statistical significance Figure 1 brings him up to date.The arrow fromOut to Paup indicates that Out is included in the regres-

sion equation explainingPaup “Statistical significance” is indicated by an

asterisk, and three asterisks signal a high degree of significance The idea

is that a statistically significant coefficient differs from zero, so thatOut

has a causal influence onPaup By contrast, an insignificant coefficient is

considered to be zero: e.g.,Old does not have a causal influence on Paup.

We return to these issues in chapter 6

Trang 29

Figure 1 Yule’s model Metropolitan unions, 1871–81.

Paup, given the values of Out, Old, Pop This assumes linearity If we

turn to prediction, there is another assumption: the system will remain stableover time Prediction is already more complicated than description On theother hand, if we make a series of predictions and test them against data, itmay be possible to show that the system is stable enough for regression to

be helpful

Causal inference is different, because a change in the system is plated—an intervention Descriptive statistics tell you about the data that youhappen to have Causal models claim to tell you what will happen to some

contem-of the numbers if you intervene to change other numbers This is a claimworth examining Something has to remain constant amidst the changes.What is this, and why is it constant? Chapters 4 and 5 will explain how tofit regression equations like (2) and (3) Chapter 6 discusses some examplesfrom contemporary social science, and examines the constancy-in-the-midst-of-changes assumptions that justify causal inference by statistical models

Response schedules will be used to formalize the constancy assumptions.

Trang 30

4 WasYule’s study a randomized controlled experiment or an observationalstudy?

5 In equation (2), suppose the coefficient ofOut had been −0.755 What

would Yule have had to conclude? If the coefficient had been+0.005?

Exercises 6–8 prepare for the next chapter If the material is unfamiliar, youmight want to read chapters 16–18 in Freedman-Pisani-Purves (2007), orsimilar material in another text Keep in mind that

variance= (standard error)2

6 SupposeX1, X2, , X nare independent random variables, with mon expectationµ and variance σ2 LetS n = X1+ X2+ · · · + X n.Find the expectation and variance ofS n Repeat forS n /n.

com-7 SupposeX1, X2, , X nare independent random variables, with a mon distribution: P (X i = 1) = p and P (X i = 0) = 1 − p, where

com-0< p < 1 Let S n = X1+ X2+ · · · + X n Find the expectation andvariance ofS n Repeat forS n /n.

8 What is the law of large numbers?

9 Keefe et al (2001) summarize their data as follows:

“Thirty-five patients with rheumatoid arthritis kept a diary for 30days The participants reported having spiritual experiences, such

as a desire to be in union with God, on a frequent basis On days thatparticipants rated their ability to control pain using religious copingmethods as high, they were much less likely to have joint pain.”Does the study show that religious coping methods are effective at con-trolling joint pain? If not, how would you explain the data?

10 According to many textbooks, association is not causation To whatextent do you agree? Discuss briefly

1.5 End notes for chapter 1

Experimental design is a topic in itself For instance, many experiments block subjects into relatively homogeneous groups Within each group, some are chosen at random for treatment, and the rest serve as controls Blinding is

another important topic Of course, experiments can go off the rails For oneexample, see EC/IC Bypass Study Group (1985), with commentary by Sundt(1987) and others The commentary makes the case that management andreporting of this large multi-center surgery trial broke down, with the resultthat many patients likely to benefit from surgery were operated on outside thetrial and excluded from tables in the published report

Trang 31

Epidemiology is the study of medical statistics More formally,

epide-miology is “the study of the distribution and determinants of health-relatedstates or events in specified populations and the application of this study tocontrol of health problems.” See Last (2001, p 62) and Gordis (2004, p 3)

Health effects of smoking See Cornfield et al (1959), International

Agency for Research on Cancer (1986) For a brief summary, see Freedman(1999) There have been some experiments on smoking cessation, but theseare inconclusive at best Likewise, animal experiments can be done, but thereare difficulties in extrapolating from one species to another Critical commen-tary on the smoking hypothesis includes Berkson (1955) and Fisher (1959).The latter makes arguments that are almost perverse (Nobody’s perfect.)

Telephones and breast cancer The correlation is 0.74 with 165

coun-tries Breast cancer death rates (age standardized) are from

http://www-dep.iarc.fr/globocan/globocan.html

Population figures, counts of telephone lines (and much else) are available at

http://www.cia.gov/cia/publications/factbook

HIP The best source is Shapiro et al (1988) The actual

randomiza-tion mechanism involved list sampling The differentials in table 1 persistthroughout the 18-year followup period, and are more marked if we take casesincident during the first 7 years of followup, rather than 5 Screening endedafter 4 or 5 years and it takes a year or two for the effect to be seen, so 7 years

is probably the better time period to use

Intention-to-treat measures the effect of assignment, not the effect ofscreening The effect of screening is diluted by crossover—only 2/3 of thewomen came in for screening When there is crossover from the treatmentarm to the control arm, but not the reverse, it is straightforward to correctfor dilution The effect of screening is to reduce the death rate from breastcancer by a factor of 2 This estimate is confirmed by results from the Two-County study See Freedman et al (2004) for a review; correcting for dilution

is discussed there, on p 72; also see Freedman (2006b)

Subjects in the treatment group who accepted screening had a muchlower death rate from all causes other than breast cancer (table 1) Why?For one thing, the compliers were richer and better educated; mortality ratesdecline as income and education go up Furthermore, the compliers probablytook better care of themselves in general See section 2.2 in Freedman-Pisani-Purves (2007); also see Petitti (1994)

Recently, questions about the value of mammography have again beenraised, but the evidence from the screening trials is quite solid For reviews,see Smith (2003) and Freedman et al (2004)

Trang 32

Snow on cholera At the end of the 19th century, there was a burst of activity in microbiology In 1878, Pasteur published La th´eorie des germes et ses applications `a la m´edecine et `a la chirurgie Around that time, Pasteur and

Koch isolated the anthrax bacillus and developed techniques for vaccination.The tuberculosis bacillus was next In 1883, there were cholera epidemics inEgypt and India, and Koch isolated the vibrio (prior work by Filippo Pacini

in 1854 had been forgotten)

There was another epidemic in Hamburg in 1892 The city fathers turned

to Max von Pettenkofer, a leading figure in the German hygiene movement ofthe time He did not believe Snow’s theory, holding instead that cholera wascaused by poison in the ground Hamburg was a center of the slaughterhouseindustry: von Pettenkofer had the carcasses of dead animals dug up and hauledaway, in order to reduce pollution of the ground The epidemic continueduntil the city lost faith in von Pettenkofer, and turned in desperation to Koch.References on the history of cholera include Rosenberg (1962), Howard-Jones (1975), Evans (1987), Winkelstein (1995) Today, the molecular biol-ogy of the cholera vibrio is reasonably well understood There are surveys byColwell (1996) and Raufman (1998) For a synopsis, see Alberts et al (1994,

pp 484, 738) For valuable detail on Snow’s work, see Vinten-Johansen et al(2003) Also see http://www.ph.ucla.edu/epi/snow.html

In the history of epidemiology, there are many examples like Snow’swork on cholera For instance, Semmelweis (1860) discovered the cause

of puerperal fever There is a lovely book by Loudon (2000) that tells thehistory, although Semmelweiss could perhaps have been treated a little moregently Around 1914, to mention another example, Goldberger showed thatpellagra was the result of a diet deficiency Terris (1964) reprints many ofGoldberger’s articles; also see Carpenter (1981) The history of beriberiresearch is definitely worth reading (Carpenter, 2000)

Quetelet A few sentences will indicate the flavor of his enterprise.

“In giving my work the title of Social Physics, I have had no other aimthan to collect, in a uniform order, the phenomena affecting man, nearly asphysical science brings together the phenomena appertaining to the materialworld .in a given state of society, resting under the influence of certaincauses, regular effects are produced, which oscillate, as it were, around afixed mean point, without undergoing any sensible alterations .

“This study .has too many attractions—it is connected on too manysides with every branch of science, and all the most interesting questions inphilosophy—to be long without zealous observers, who will endeavour tocarry it farther and farther, and bring it more and more to the appearance of

a science.” (Quetelet 1842, pp vii, 103)

Trang 33

Yule The “errors” in (1) and (2) play different roles in the theory In (1),

we have random errors which are unobservable parts of a statistical model In(2), we have residuals which can be computed as part of model fitting; (3) islike (2) Details are in chapter 4 For sympathetic accounts of the history, seeStigler (1986) and Desrosi`eres (1993) Meehl (1954) provides some well-known examples of success in prediction by regression Predictive validity

is best demonstrated by making real “ex ante”—before the fact—forecasts

in several different contexts: predicting the future is a lot harder than fittingregression equations to the past (Ehrenberg and Bound 1993)

John Stuart Mill. The contrast between experiment and observationgoes back to Mill (1843), as does the idea of confounding (In the seventhedition, see Book III, Chapters VII and X, esp pp 423 and 503.)

Experiments vs observational studies Fruits-and-vegetables

epidemi-ology is a well-known case where experiments contradict observational data

In brief, the observational data say that people who eat a vitamin-rich dietget cancer at lower rates, “so” vitamins prevent cancer The experiments saythat vitamin supplements either don’t help or actually increase the risk.The problem with the observational studies is that people who eat (forexample) five servings of fruit and vegetables every day are different fromthe rest of us in many other ways It is hard to adjust for all these differences

by purely statistical methods (Freedman-Pisani-Purves, 2007, p 26 and note

23 on p A6) Research papers include Clarke and Armitage (2002), Virtamo

et al (2003), Lawlor et al (2004), Cook et al (2007) Hercberg et al (2004)get a positive effect for men not women

Hormone replacement therapy (HRT) is another example (Petitti 1998,2002) The observational studies say that HRT prevents heart disease inwomen, after menopause The experiments show that HRT has no benefit.The women who chose HRT were different from other women, in ways thatthe observational studies missed We will discuss HRT again in chapter 7.Ioannidis (2005) shows that by comparison with experiments, across avariety of interventions, observational studies are much less likely to giveresults which can be replicated Also see Kunz and Oxman (1998)

Anecdotal evidence—based on individual cases, without a systematiccomparison of different groups—is a weak basis for causal inference If there

is no control group in a study, considerable skepticism is justified, especially

if the effect is small or hard to measure When the effect is dramatic, aswith penicillin for wound infection, these statistical caveats can be set aside

On penicillin, see Goldsmith (1946), Fleming (1947), Hare (1970), Walsh(2003) Smith and Pell (2004) have a good—and brutally funny—discussion

of causal inference when effects are large

Trang 34

The Regression Line

2.1 Introduction

This chapter is about the regression line The regression line is important

on its own (to statisticians), and it will help us with multiple regression inchapter 4 The first example is a scatter diagram showing the heights of 1078fathers and their sons (figure 1) Each pair of fathers and sons becomes a dot

on the diagram The height of the father is plotted on thex-axis; the height of

his son, on they-axis The left hand vertical strip (inside the chimney) shows

the families where the father is 64 inches tall to the nearest inch; the right handvertical strip, families where the father is 72 inches tall Many other stripscould be drawn too The regression line approximates the average height ofthe sons, given the heights of their fathers This line goes through the centers

of all the vertical strips The regression line is flatter than the SD line, which

is dashed “SD” is shorthand for “standard deviation”; definitions come next

2.2 The regression line

We haven subjects indexed by i = 1, , n, and two data variables x

andy A data variable stores a value for each subject in a study Thus, x iisthe value ofx for subject i, and y iis the value ofy In figure 1, a “subject”

is a family: x i is the height of the father in familyi, and y iis the height of

Trang 35

Figure 1 Heights of fathers and sons Pearson and Lee (1903).

the son For Yule (section 1.4), a “subject” might be a metropolitan union,withx i = Out for union i, and y i = Paup.

The regression line is computed from five summary statistics: (i) theaverage ofx, (ii) the SD of x, (iii) the average of y, (iv) the SD of y, and

(v) the correlation betweenx and y The calculations can be organized as

follows, with “variance” abbreviated to “var”; the formulas fory and var(y)

Trang 36

We’re tacitly assumings x = 0 and s y = 0 Necessarily, −1 ≤ r ≤ 1: see

exercise B16 below The correlation between x and y is often written as r(x, y) Let sign(r) = +1 when r > 0 and sign(r) = −1 when r < 0 The

regression line is flatter than the SD line, by (5) and (6) below

(5) The regression line ofy on x goes through the point of averages

(x, y) The slope is rs y /s x The intercept isy − slope · x

(6) The SD line also goes through the point of averages The slope

is sign(r)s y /s x The intercept isy − slope · x

Figure 2 Graph of averages The dots show the average height

of the sons, for each value of father’s height The regression line(solid) follows the dots: it is flatter than the SD line (dashed)

Trang 37

The regression ofy on x, also called the regression line for predicting y from

x, is a linear approximation to the graph of averages, which shows the average

value ofy for each x (figure 2).

Correlation is a key concept Figure 3 shows the correlation coefficientfor three scatter diagrams All the diagrams have the same number of points(n = 50), the same means (x = y = 50), and the same SDs (s x = s y = 15).The shapes are very different The correlation coefficientr tells you about the

shapes (If the variables aren’t paired—two numbers for each subject—youwon’t be able to compute the correlation coefficient or regression line.)

Figure 3 Three scatter diagrams The correlation measures the

extent to which the scatter diagram is packed in around a line If

the sign is positive, the line slopes up If sign is negative, the line

slopes down (not shown here)

CORR 0.50

0 25 50 75 100

CORR 0.90

If you use the liney = a + bx to predict y from x, the error or residual

for subjecti is e i = y i − a − bx i, and the MSE is

The RMS error is the square root of the MSE For the regression line, as will

be seen later, the MSE equals(1 − r2) var y The abbreviations: MSE stands

for mean square error; RMS, for root mean square

A Theorem due to C.-F Gauss Among all lines, the regression linehas the smallest MSE

A more general theorem will be proved in chapter 3 If the material insections 1–2 is unfamiliar, you might want to read chapters 8–12 in Freedman-Pisani-Purves (2007)

Trang 38

2.3 Hooke’s law

A weight is hung on the end of a spring whose length under no load isa.

The spring stretches to a new length According to Hooke’s law, the amount

of stretch is proportional to the weight If you hang weightx i on the spring,the length is

Y i = a + bx i +  i , for i = 1, , n.

(7)

Equation (7) is aregression model In this equation, a and b are constants that

depend on the spring The values are unknown, and have to be estimated fromdata These areparameters The  i are independent, identically distributed,mean 0, varianceσ2 These arerandom errors, or disturbances The variance

σ2 is another parameter You choose x i, the weight on occasion i The

responseY i is the length of the spring under the load You do not seea, b, or

the i

Table 1 shows the results of an experiment on Hooke’s law, done in aphysics class at U.C Berkeley The first column shows the load The secondcolumn shows the measured length (The “spring” was a long piece of pianowire hung from the ceiling of a big lecture hall.)

Table 1 An experiment on Hooke’s law

We use the method of least squares to estimate the parametersa and b.

In other words, we fit the regression line The intercept is

Trang 39

There are two conclusions (i) Putting a weight on the spring makes

it longer (ii) Each extra kilogram of weight makes the spring about 0.05centimeters longer The first is a (pretty obvious) qualitative inference; thesecond is quantitative The distinction between qualitative and quantitativeinference will come up again in chapter 6

Exercise set A

1 In the Pearson-Lee data, the average height of the fathers was 67.7 inches;the SD was 2.74 inches The average height of the sons was 68.7 inches;the SD was 2.81 inches The correlation was 0.501

(a) True or false and explain: because the sons average an inch tallerthan the fathers, if the father is 72 inches tall, it’s 50–50 whether theson is taller than 73 inches

(b) Find the regression line of son’s height on father’s height, and itsRMS error

2 Can you determine a in equation (7) by measuring the length of the

spring with no load? With one measurement? Ten measurements? plain briefly

Ex-3 Use the data in table 1 to find the MSE and the RMS error for theregression line predicting length from weight Which statistic gives abetter sense of how far the data are from the regression line? Hint: keeptrack of the units, or plot the data, or both

4 The correlation coefficient is a good descriptive statistic for one of thethree diagrams below Which one, and why?

Trang 40

Looks the same? Take another look In the regression model (7), we can’tsee the parametersa, b or the disturbances  i In the fitted model (8), theestimatesˆa, ˆb are observable, and so are the residuals e i With a large sample,

ˆa = a and ˆb = b, so e i =  i But

and variance 35/12, by formula (1) So far, we have a tiny data set Randomvariables are coming next

Throw a dien times (A die has six faces, all equally likely; one face

has 1 spot, another face has 2 spots, and so forth, up to 6.) LetU i be thenumber of spots on theith roll, for i = 1, , n The U iare (better, are mod-eled as) independent, identically distributed random variables—like choosingnumbers at random from the list{1, 2, 3, 4, 5, 6} Each random variable has

mean (expectation, aka expected value) equal to 3.5, and variance equal to35/12 Here, mean and variance have been applied to a random variable—thenumber of spots when you throw a die

The sample mean and the sample variance are

expectation and variance, respectively, ofU i Whenn is large,

(10) U = E(U i ) = 3.5, var{U1, , U n } = var (U i ) = 35/12.

That is how the expectation and variance of a random variable are estimatedfrom repeated observations

Ngày đăng: 08/08/2018, 16:56

TỪ KHÓA LIÊN QUAN