1. Trang chủ
  2. » Giáo Dục - Đào Tạo

P(rep) the probability of replicating an effect

12 4 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 12
Dung lượng 204,66 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Killeen Arizona State University Killeen@asu.edu Abstract Prep, gives the probability that an equally powered replication attempt will provide supportive evidence—an effect of the same

Trang 1

Prep: the Probability of Replicating an Effect

Peter R Killeen

Arizona State University

Killeen@asu.edu

Abstract

Prep, gives the probability that an equally powered replication attempt will provide supportive

evidence—an effect of the same sign as the original, or, if preferred, the probability of a

significant effect in replication Prep is based on a standard Bayesian construct, the posterior

predictive distribution It may be used in 3 modes: to evaluate evidence; to inform belief; and to

guide action In the first case the simple prep is used; in the second, it is augmented with estimates

of realization variance and informed priors; in the third it is embedded in a decision theory Prep

throws new light on replicability intervals, multiple comparisons, traditional α levels, and

longitudinal studies As the area under diagnosticity vs detectability curve, it constitutes a

criterion-free measure of test quality

The issue

The foundation of science is the replication of experimental effects But most statistical analyses

of experiments test, not whether the results are replicable, but whether they are unlikely if there were truly no effect present This inverse inference creates many problems of interpretation that have become increasingly evident to the field One of the consequences has been an uneasy relationship between the science of psychology and its practice The irritant is the scientific inferential method Not the method of John Stewart Mill, Michael Faraday, or Charles Darwin; but that of Ronald Aylmer Fisher, Egon Pearson, and Jerzy Neyman All were great scientists or statisticians Fisher both But they grappled with scientific problems on different scales of time, space, and complexity than do clinical psychologists In all cases their goal was to epitomize a phenomenon with simple verbal or mathematical descriptions, and then to show that such a description has legs: That it explains data or predicts outcomes in new situations But

Trang 2

increasingly it is being realized that results in biopsychosocial research are often lame: Effect sizes can wither to irrelevance with subsequent replications, and credible authorities claim that

“most published research findings are false” Highly significant effects can have negligible therapeutic value Something in our methodology has failed Must clinicians now turn away from such toxic “evidence-based” research, back to clinical intuition?

The historical context

In the physical and biological sciences precise numerical predictions can sometimes be made:

The variable A should take the value a A may have been a geological age, a vacuum

permittivity, or the deviation of a planetary orbit The more precise the experiment, the more difficult it is for errant theories to pass muster In the behavioral sciences it is rare be able to

make the prediction A -> a Our questions are typically not “does my model of the phenomenon predict the observed numerical outcome?”, but rather “is my candidate causal factor C really

affecting the process?”; “Does childhood trauma increase the risk of adult PTSD?” It then

becomes a test of two candidate models: No effect: A + C ≈ A -> a; or some effect: A + C -> a +

c, where we cannot specify c beforehand, but rather prefer that it be large, and in a particular direction Since typically we also cannot specify the baseline or control level a, we test to see

whether the difference in the effects were reliably different than zero: testing experimental (A +

C), and control (A) groups, and asking whether a + c ¿=? a, that is does the difference in

outcome between experimental and control groups equal zero: (a + c) - a = 0? Since the

difference will almost always be different from 0 due to random variation, the question evolves to: Is it sufficiently larger than 0 so that we can have some confidence that the effect is real—that

is, that it will replicate? We want to know whether (a + c) - a is larger than some criterion How

large should that criterion be?

It was for such situations that Fisher formalized and prior work into the analysis of

variance ANOVA ANOVA estimates the background levels of variability—error, or noise—

combining the variance within each of the groups studied, and asking whether the variability between groups the treatment effect, or signal sufficiently exceeds that noise The signal-to-noise ratio is the F statistic If the errors are normally distributed and the groups independent,

with no true effect (that is, all are drawn from the same population, so that A + C = A, and thus c

≈ 0) we can say precisely how often the F ratio will exceed a criterion α (alpha) If our treatment effect exceeds that value, it is believed to be unlikely that the assumption of “no effect” is true

Trang 3

Because of its elegance, robustness, and refinement over the decades, ANOVA and its variants are the most popular inferential statistics in psychology These virtues derive from certain

knowledge of the ideal case, the null hypothesis, with deviations being precisely characterized by p-values significance levels

But there are problems associated with the uncritical use of such null-hypothesis statistical tests (NHST), ones well known to the experts, and repeated anew to every generation of students

(e.g., Krueger 2001) Among them: One cannot infer from NHST either the truth or falsity of the null hypothesis; nor can one infer the truth or falsity of the alternative ANOVA gives the

probability of data assuming the null, not the probability of the null given the data (see, e.g., Nickerson 2000) Yet, rejection of the null is de facto the purpose to which the results are

typically put Even if the null is (illogically) rejected, significance levels do not give a clear

indication of how replicable a result is It was to provide such a measure that prep, the probability

of replication, was introduced (Killeen 2005a)

The logic of prep

Prep is a probability derived from a Bayesian posterior predictive distribution (ppd) Assume you

have conducted a pilot experiment on a new treatment for alleviating depression, involving 20 control and 20 experimental participants, and found that the means and standard deviations were:

40 (12), 50 (15) The effect size, d, the difference of means divided by the pooled estimate of standard deviation (13.6), is a respectable 0.74 Your t-test reports p < 05; indicating that this

result is unlikely under the null hypothesis Is the result replicable? The answer depends on what you consider a replication to be, and what you are willing to assume about the context of the experiment First the general case, and then the particulars

Most psychologists know that a sampling distribution is the probability of finding a

statistic such the effect size, d given the “true” value of the population parameter, δ (delta) Under the null, δ = 0: The sampling distribution, typically a normal or t-distribution, is centered

on 0 If the experimental and control groups are the same size and sum to n, then the variance of the distribution is approximately 4/(n - 4) This is shown in Figure 1 The area to the right of the initial result, d1 = 0.74, is less than α = 05, so the result qualifies as significant To generate a predictive distribution, move the sampling distribution from 0 to its most likely place Given

knowledge of only your data, that is the obtained effect size, d1 = 0.74 If this was the true effect

Trang 4

size δ, then that shifted sampling distribution would also give the probability of a replication: The probability that it would be significant is the area under this distribution that lies to the right

of the α cut-off, 0.675, approximately 58% The probability of a replication returning in the

wrong direction is the area under this curve to the left of 0—which equals the 1-tailed p-value of

the initial study

Figure 1 The curve centered on 0 is a sampling distribution for effect size, d, under the null

hypothesis Shifted to the right it gives the predicted distribution of effect sizes in replications, in

case the true effect size, δ, equals the recorded effect size d1 Since we do not know that δ

precisely equals d1, because both the initial and replicate will incur sampling error, the variance

of the distribution is increased—doubled in the case of an equal-powered replication, to create

the posterior predictive distribution (ppd), the intermediate distribution on the right In the case

of a conceptual rather than strict replication, additional realization variance is added, resulting in

the lowest ppd In all cases, the area under the curves to the right of the origin gives the

probability of supportive evidence in replication

If we knew that the true effect size was exactly δ = 0.74, no further experiments would be necessary But we do not know what δ is; we can only estimate it from the original results There are thus at least two sources of error: the sampling error in the original, and in the ensuing

replication This leads to a doubling of the variance in prediction, for a systematic replication

Trang 5

attempt of the same power The ppd is located at the obtained estimate of d, d1, and has twice the variance of the sampling distribution—8/(n-4) The resulting probability of achieving a

significant effect in replication, the area to the right of 0.675, shrinks to 55%

What constitutes evidence of replication?

What if an ensuing replication found an effect size of 0.5? That is below your estimate of 0.74, and falls short of significance Is this evidence for or against the original claim? It would

probably be reported as “failure to replicate” But that is misleading: If those data had been part

of your original study, the increase of n would have more than compensated for the decrease in d,

substantially improving the significance level of the results The claim was for a causal factor, and the replication attempt, though not significant, returned evidence that (weakly) supports that

claim It is straightforward to compute the probability of finding supporting evidence of any

strength in replication The probability of that a positive effect in replication is the area under

the ppd to the right of 0 In this case that area is 94, suggesting a very good probability that in

replication the result will not go the wrong way and contradict your original results This is the

basic version of prep

What constitutes a replication?

The above assumed that the only source of error was sampling variability But there are other sources as well, especially in the most useful case of replication, a conceptual replication

involving a different population of participants, and different analytic techniques Call this

“random effects” variability realization variance, here σ2R In social science research it is

approximately σ2R = 0.08 across various research contexts This noise reduces replicability,

especially for studies with small effect sizes, by further increasing the spread of the ppd The

median value of σ2R = 0.08 limits all effect sizes less than 0.5 to prep < 90, no matter how many

data they are based on In the case of the above example, it reduces prep from 94 to 88, so that

there is 1 chance in 8 that a conceptual replication will come back in the wrong direction

What is the best predictor of replicability?

In the above example all that was known were the results of the experiment: We assumed “flat priors” a priori ignorance of the probable effect size In fact, however, more than that is

typically known, or suspected; the experiment comes from a research tradition in which similar kinds of effects have been studied If the experiment had concerned the effect of the activation of

Trang 6

a randomly chosen gene, or of a randomly chosen brain region, on a particular behavior, the prior

distribution would be tightly centered close to 0, and the ppd would move down toward 0 If,

however, the experiment is studying a large effect that had been reported by 3 other laboratories,

the priors would be centered near their average effect size, and the ppd moved up toward them

The distance moved depends on the relative weight of evidence in the priors and in the current data Exactly how much weight should be given to each is a matter of art and argument The answer depends largely on which of the following three questions is on the table:

How should I evaluate this evidence? To avoid capricious and ever-differing evaluations of

replicability of results due to diverse subjective judgments of the weight of more or less relevant

priors, prep was presented for the case of flat, ignorance priors This downplays precision of

prediction in the service of stability and generality of evaluation; it decouples the evaluation of new data from the sins, and virtues, of their heritage It uses only the information in the data at

hand, or that augmented with a standardized estimate of realization variance

What should I believe? Here priors matter: Limiting judgment to only the data in hand is

shortsighted If a novel experiment provides evidence for extra-sensory pre-cognition, what you should believe should be based on the corpus of similar research, updated by the new data In this case, it is likely that your priors will dominate what you believe

What should I do? NHST is of absolutely no value in guiding action, as it gives neither the

probability of the null nor of the alternative, nor can it give the probability of replication, which

is central to planning Prep is designed to predict replicability, and has been developed into a decision theory for action (Killeen 2006) Figure 2 displays a ppd and superimposed utility

functions that describe the value, or utility, of various effect sizes To compute expected value, integrate the product of the utility function with the probability of each outcome, as given by the

ppd The utility shown as dashed lines is 0 until effect size exceeds 0, then immediately steps to

1 Its expected value is prep, the area under the curve to the right of 0 Prep has a 1-to-1

relationship with the p-value Thus, NHST (and prep when σ2R is 0) is intrinsically indifferent to size of effect, giving equal weighting to all positive effect sizes, and none to negative ones

Trang 7

Figure 2 A candidate utility function is drawn as a power function of effect size (ogive) The value of an effect increases less than proportionately with its size The expected utility of a future

course of action is the probability of each particular outcome (the ppd) multiplied by the utility

function—the integral of the product of the two functions Because traditional significance tests give no weight to effect size, their implicit utility function is flat (dashed) If drawn as a line at -7

up to the origin, and then at 1 to the right, it sets a threshold for positive utility in replication at the traditional level α = 0.05 Other exponents for the utility function return other criteria, such

as the Akaike criterion and the Bayesian information criterion The ogive gives approximately equal weight to effect size and to replicability

If the weight on negative effect sizes were -7, then the expected utility of an effect in

replication would be negative for all ppd whose area to the left of the origin was greater than 1/7

This sets a criterion for positive action that is identical to the α = 05 criterion Conversely, this traditional criterion α = 05 de facto sets the disutility of a false alarm as seven times the utility

of a hit α = 01 corresponds to a 19/1 valuation of false positives to true positives This

exposition thus rationalizes the α levels traditional in NHST

Economic valuations are never discontinuous like these step functions; rather they look more like the ogive, shown in Figure 2, which is a power function of effect size To raise

expected utility above a threshold for action, such ogives require more accuracy—typically

Trang 8

larger n—when effect sizes are small than does NHST; conversely large effect sizes pass criteria with smaller values of n—and replicability Depending on the exponent of the utility function, it

will emulate traditional decision rules based on AIC, BIC, and adjusted coefficient of

determination In the limits, as the exponent approaches 0 it returns the traditional step function

of NHST, indifferent to effect size; as it approaches 1, only effect size, not replicability, matters

A power of 1/3 weights them approximately equally Thus prediction, built upon the ppd, and

modulated by the importance of potential effects, can guide behavior; NHST can not

How reliable are predictions of replicability?

Does positive psychology enhance well-being, or ameliorate depressive symptoms? A recent meta-analysis of positive psychology interventions found a mean effect size of 0.3 for both dependent variables over 74 interventions (Sin and Lyubomirsky 2009) With an average of 58 individuals per condition, and setting σ2R = 0.08, prep is 88 Of the 74 studies, 65 should

therefore have found a positive effect 66 found a positive effect Evaluation of other

meta-analyses shows similar high levels of accuracy for prep’s predictions

We may also predict that 1 of the studies in this ensemble should have gone the wrong way

strongly (its prep >0.85 for a negative effect) What if yours had been one of the 8 studies that showed no or negative effects? The most extreme negative effect had a prep of a severely

misleading 88 (for negative replicates)! Prep gives an expected, average estimate of replicability (Cumming 2005); but it, like a p-value, typically has a high associated variance (Killeen 2007) It

is because we cannot say beforehand whether you will be one of the unlucky few, that some experts (e.g., Miller 2009) have disavowed the possibility of predicting replicability in general, and of individual research results in particular Those with a more Bayesian perspective are willing to bet that your results will not be the most woeful of the 74, but rather closer to the typical It is your money, to bet or hold; but as a practitioner, you must eventually recommend a course of action Whereas reserving judgment is a traditional retreat of the academic, it is can be

an unethical one for the practitioner Prep, used cautiously, provides a guide to action

What else can be done with the ppd?

Replicability intervals While more informative than p-values, confidence intervals are

underused and generally poorly understood Replicability intervals delimit the values within which a replication is likely to fall 50% replicability intervals are approximately equal to the

Trang 9

standard error of the statistic These traditional measures of stability of estimation may be

centered on the statistic, and de facto constitute the values within which replications will fall half the time

Multiple comparisons If a number of comparisons have been performed, how do we decide if

the ensemble of results is replicable? We are appropriately warned against alpha inflation in such

circumstances, and similar considerations affect prep But some inferences are straightforward If

the tests are independent (as assumed, for example in ANOVA), then the probability of a

replication showing all effects to be in the same direction (or significant, etc.) is simply the product of the replicabilities of all individual tests The probability that none will again achieve your definition of replication is the complement of the product of the complements of each of the

preps Is there a simple way to recalibrate the replicability of one of k tests, post hoc? If all the

tests asked the exactly same question that is, constituted within-study replications—the

probability that all would replicate is the focal prep raised to the k th power This conservative adjustment is similar in spirit to the Šidák correction, and suitably reins-in predictions of

replicability for a post-hoc test

Model comparison and longitudinal studies Ashby and O’Brien (2008) have generalized the use

of prep for the situation of multiple trials with a small number of participants, showing how to evaluate alternate models against different criteria (e.g., AIC, BIC) Their analysis is of special interest both to psychophysicists, and to clinicians conducting longitudinal studies

Diagnosticity vs detectability Tests can succeed in two ways: they can affirm when the state of the world is positive (a hit), and they can deny when it is negative (a correct rejection) Likewise they can fail in two ways: affirm when the state of the world is negative (a false alarm, a Type I error), and deny when it is positive (a miss, a Type II error) The detectability of a test is its hit

rate; the diagnosticity is its correct rejection rate Neither alone is an adequate measure of the quality of a test: Detectability of a test can be perfect if we always affirm, driving the

diagnosticity to 0—We can detect 100% of children with ADHD if the test is “Do they move?”

A Relative Operating Characteristic, or ROC gives the hit rate as a function of the false alarm rate The location on the curve gives the performance for a particular criterion If the criterion for false alarms is set at α = 05, then the ordinate gives the power of the test But that criterion is arbitrary What is needed to evaluate a test is the information it conveys independently of the particular criterion chosen The area under the ROC curve does just that: It measures the quality

Trang 10

of the test independently of the criterion for action Irwin (2009) has shown that this area is

precisely the probability computed by prep: prep thus constitutes a criterion-free measure of the

quality of a diagnostic test

Efficacy vs effectiveness In the laboratory an intervention may show significant effects of good

size—its efficacy—but in the field its impact—its effectiveness—will vary, and will often

disappoint There are many possible reasons for this difference, such as differences in the skills

of administering clinicians, the need to accommodate individuals with comorbidities, and so on These variables increase realization variance, and thus decrease replicability Finding that

effectiveness is generally less than efficacy is but another manifestation of realization variance

How can prep improve the advice given to patients?

What is the probability that a depressed patient will benefit from a positive psychology

intervention? A representative early study found that group therapy was associated with a

significant decrease in Beck Depression Inventory scores for a group of mildly to moderately depressed young adults The effects were enduring, with an effect size of 0.6 at 1-year posttest

Assuming a standard realization variance of 0.08, prep is 81 But that is for an equal-powered replication What is the probability that your patient could benefit from this treatment? Here we

are replicating with an n of 1, not 33 Instead of doubling the variance of the original study, we must add to it the variance of the sampling distribution for n = 1; that is, the standard deviation

of effect size, 1 This returns a prep of 0.71 Thus, there is about a 70% chance that positive psychotherapy will help your patient for the ensuing year which, while not great, may be better

than the alternatives Even when the posterior is based on all of the data in the meta-analysis, n >

4000, it does not change the odds for your patient, as that is here limited by the effect sizes for these interventions, and your case of 1 You nonetheless have an estimate to offer her, insofar as

it may be in her interest

Why is prep controversial?

The original exposition contained errors (Doros and Geier 2005), later corrected (Killeen 2005b,

2007) Analyses show that prep is biased when used to predict the coincidence in the sign of the

effects of two future experiments; strongly biased when the null is true (Miller and Schwarz

2011) or when the true effect size is stipulated But prep is not designed to predict the coincidence

Ngày đăng: 13/10/2022, 14:42

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w