BEFORE YOU DRAW CONCLUSIONS

Một phần của tài liệu Errors in Statistics (Trang 73 - 79)

Before you draw conclusions, be sure you have accounted for all missing data, interviewed nonresponders, and determined whether the data were missing at random or were specific to one or more subgroups.

During the Second World War, a group was studying planes returning from bombing Germany. They drew a rough diagram showing where the bullet holes were and recommended those areas be reinforced. A statisti-

cian, Abraham Wald [1980],10pointed out that essential data were missing from the sample they were studying. What about the planes that didn’t return from Germany?

When we think along these lines, we see that the two areas of the plane that had almost no bullet holes (where the wings and where the tail joined the fuselage) are crucial. Bullet holes in a plane are likely to be at random, occurring over the entire plane. Their absence in those two areas in returning bombers was diagnostic. Do the data missing from your experi- ments and surveys also have a story to tell?

Induction

“Behold! human beings living in an underground den, which has a mouth open towards the light and reaching all along the den;

here they have been from their childhood, and have their legs and necks chained so that they cannot move, and can only see before them, being prevented by the chains from turning round their heads. Above and behind them a fire is blazing at a distance, and between the fire and the prisoners there is a raised way; and you will see, if you look, a low wall built along the way, like the screen which marionette players have in front of them, over which they show the puppets.”

“And they see only their own shadows, or the shadows of one another, which the fire throws on the opposite wall of the cave.”

“To them, I said, the truth would be literally nothing but the shadows of the images.”

The Allegory of the Cave (Plato, The Republic, Book VII).

Never assign probabilities to the true state of nature, but only to the validity of your own predictions.

A pvalue does not tell us the probability that a hypothesis is true, nor does a significance level apply to any specific sample; the latter is a charac- teristic of our testing in the long run. Likewise, if all assumptions are satis- fied, a confidence interval will in the long run contain the true value of the parameter a certain percentage off the time. But we cannot say with certainty in any specific case that the parameter does or does not belong to that interval (Neyman, 1961, 1977).

When we determine a pvalue, we apply a set of algebraic methods and deductive logic to deducethe correct value. The deductive process is used

CHAPTER 5 TESTING HYPOTHESES: CHOOSING A TEST STATISTIC 73

10 This reference may be hard to obtain. Alternatively, see Mangel and Samaniego [1984].

to determine the appropriate size of resistor to use in an electric circuit, to determine the date of the next eclipse of the moon, and to establish the identity of the criminal (perhaps from the fact the dog did not bark on the night of the crime). Find the formula, plug in the values, turn the crank, and out pops the result (or it does for Sherlock Holmes,11at least).

When we assert that for a given population that a percentage of samples will have a specific composition, this is a deduction also. But when we make an inductivegeneralization about a population based upon our analysis of a sample, we are on shakier ground. Newton’s Law of gravi- tation provided an exact fit to observed astronomical data for several centuries; consequently, there was general agreement that Newton’s gener- alization from observation was an accurate description of the real world.

Later, as improvements in astronomical measuring instruments extended the range of the observable universe, scientists realized that Newton’s Law was only a generalization and not a property of the universe at all.

Einstein’s Theory of Relativity gives a much closer fit to the data, a fit that has not been contradicted by any observations in the century since its for- mulation. But this still does not mean that relativity provides us with a complete, correct, and comprehensive view of the universe.

In our research efforts, the only statements we can make with God-like certainty are of the form “our conclusions fit the data.” The true nature of the real world is unknowable. We can speculate, but never conclude.

The gap between the sample and the population will always require a leap of faith. We understand only in so far as we are capable of under- standing [Lonergan, 1992].

SUMMARY

Know your objectives in testing. Know your data’s origins. Know the assumptions you feel comfortable with. Never assign probabilities to the true state of nature, but only to the validity of your own predictions. Col- lecting more and better data may be your best alternative.

TO LEARN MORE

For commentary on the use of wrong or inappropriate statistical methods, see Avram et al. [1985], Badrick and Flatman [1999], Berger et al.

[2002], Bland and Altman [1995], Cherry [1998], Dar, Serlin, and Omer [1997], Elwood [1998], Felson, Cupples, and Meenan [1984], Fienberg [1990], Gore, Jones, and Rytter [1977], Lieberson [1985], MacArthur

11 See “Silver Blaze” by A. Conan-Doyle, Strand Magazine, December 1892.

and Jackson [1984], McGuigan [1995], McKinney et al. [1989], Miller [1986], Padaki [1989], Welch and Gabbe [1996], Westgard and Hunt [1973], White [1979], and Yoccuz [1991].

Guidelines for reviewers are provided by Altman [1998a], Bacchetti [2002], Finney [1997], Gardner, Machin, and Campbell [1986], George [1985], Goodman, Altman, and George [1998], International Committee of Medical Journal Editors [1997], Light and Pillemer [1984], Mulrow [1987], Murray [1988], Schor and Karten [1966], and Vaisrub [1985].

For additional comments on the effects of the violation of assumptions, see Box and Anderson [1955], Friedman [1937], Gastwirth and Rubin [1971], Glass, Peckham, and Sanders [1972], and Pettitt and Siskind [1981].

For the details of testing for equivalence, see Dixon [1998]. For a review of the appropriate corrections for multiple tests, see Tukey [1991].

For procedures with which to analyze factorial and other multifactor experimental designs, see Chapter 8 of Pesarin [2001].

Most of the problems with parametric tests reported here extend to and are compounded by multivariate analysis. For some solutions, see Chapter 5 of Good [2000] and Chapter 6 of Pesarin [2001].

For a contrary view on the need for adjustments of pvalues in multiple comparisons, see Rothman [1990a].

Venn [1888] and Reichenbach [1949] are among those who’ve at- tempted to construct a mathematical bridge between what we observe and the reality that underlies our observations. To the contrary, extrapolation from the sample to the population is not a matter of applying Holmes-like deductive logic but entails a leap of faith. A careful reading of Locke [1700], Berkeley [1710], Hume [1748], and Lonergan [1992] is an essential prerequisite to the application of statistics.

For more on the contemporary view of induction, see Berger [2002]

and Sterne, Smith, and Cox [2001]. The former notes that, “Dramatic illustration of the non-frequentist nature of p-values can be seen from the applet available at http://www.stat.duke.edu/~berger. The applet assumes one faces a series of situations involving normal data with unknown mean qand known variance, and tests of the form H: q=0 versus K: qπ0.

The applet simulates a long series of such tests, and records how often His true for p-values in given ranges.”

CHAPTER 5 TESTING HYPOTHESES: CHOOSING A TEST STATISTIC 75

Chapter 6

Strengths and Limitations of Some Miscellaneous

Statistical Procedures

CHAPTER 6 LIMITATIONS OF SOME MISCELLANEOUS STATISTICAL PROCEDURES 77

THE GREATEST ERROR ASSOCIATED WITH THE USE OF statistical procedures is to make the assumption that one single statistical methodology can suffice for all applications.

From time to time, a new statistical procedure will be introduced or an old one revived along with the assertion that at last the definitive solution has been found. As is so often the case with religions, at first the new methodology is reviled, even persecuted, until it grows in the number of its adherents, at which time it can begin to attack and persecute the adherents of other, more established dogma in its turn.

During the preparation of this text, an editor of a statistics journal rejected an article of one of the authors on the sole grounds that it made use of permutation methods.

“I’m amazed that anybody is still doing permutation tests . . .” wrote the anonymous reviewer, “There is probably nothing wrong technically with the paper, but I personally would reject it on grounds of irrelevance to current best statistical practice.” To which the editor sought fit to add,

“The reviewer is interested in estimation of interaction or main effects in the more general semiparametric models currently studied in the literature.

It is well known that permutation tests preserve the significance level but that is all they do is answer yes or no.”1

But one methodology can never be better than another, nor can estima- tion replace hypothesis testing or vice versa. Every methodology has a proper domain of application and another set of applications for which it

1 A double untruth. First, permutation tests also yield interval estimates; see, for example, Garthwaite [1996]. Second, semiparametric methods are not appropriate for use with small- sample experimental designs, the topic of the submission.

fails. Every methodology has its drawbacks and its advantages, its assump- tions, and its sources of error. Let us seek the best from each statistical procedure.

The balance of this chapter is devoted to exposing the frailties of four of the “new” (and revived) techniques: bootstrap, Bayesian methods, meta- analysis, and permutation tests.

Một phần của tài liệu Errors in Statistics (Trang 73 - 79)

Tải bản đầy đủ (PDF)

(223 trang)