Killeen Arizona State University Killeen@asu.edu Abstract Prep, gives the probability that an equally powered replication attempt will provide supportive evidence—an effect of the same
Trang 1Prep: the Probability of Replicating an Effect
Peter R Killeen
Arizona State University
Killeen@asu.edu
Abstract
Prep, gives the probability that an equally powered replication attempt will provide supportive
evidence—an effect of the same sign as the original, or, if preferred, the probability of a
significant effect in replication Prep is based on a standard Bayesian construct, the posterior
predictive distribution It may be used in 3 modes: to evaluate evidence; to inform belief; and to
guide action In the first case the simple prep is used; in the second, it is augmented with estimates
of realization variance and informed priors; in the third it is embedded in a decision theory Prep
throws new light on replicability intervals, multiple comparisons, traditional α levels, and
longitudinal studies As the area under diagnosticity vs detectability curve, it constitutes a
criterion-free measure of test quality
The issue
The foundation of science is the replication of experimental effects But most statistical analyses
of experiments test, not whether the results are replicable, but whether they are unlikely if there were truly no effect present This inverse inference creates many problems of interpretation that have become increasingly evident to the field One of the consequences has been an uneasy relationship between the science of psychology and its practice The irritant is the scientific inferential method Not the method of John Stewart Mill, Michael Faraday, or Charles Darwin; but that of Ronald Aylmer Fisher, Egon Pearson, and Jerzy Neyman All were great scientists or statisticians Fisher both But they grappled with scientific problems on different scales of time, space, and complexity than do clinical psychologists In all cases their goal was to epitomize a phenomenon with simple verbal or mathematical descriptions, and then to show that such a description has legs: That it explains data or predicts outcomes in new situations But
Trang 2increasingly it is being realized that results in biopsychosocial research are often lame: Effect sizes can wither to irrelevance with subsequent replications, and credible authorities claim that
“most published research findings are false” Highly significant effects can have negligible therapeutic value Something in our methodology has failed Must clinicians now turn away from such toxic “evidence-based” research, back to clinical intuition?
The historical context
In the physical and biological sciences precise numerical predictions can sometimes be made:
The variable A should take the value a A may have been a geological age, a vacuum
permittivity, or the deviation of a planetary orbit The more precise the experiment, the more difficult it is for errant theories to pass muster In the behavioral sciences it is rare be able to
make the prediction A -> a Our questions are typically not “does my model of the phenomenon predict the observed numerical outcome?”, but rather “is my candidate causal factor C really
affecting the process?”; “Does childhood trauma increase the risk of adult PTSD?” It then
becomes a test of two candidate models: No effect: A + C ≈ A -> a; or some effect: A + C -> a +
c, where we cannot specify c beforehand, but rather prefer that it be large, and in a particular direction Since typically we also cannot specify the baseline or control level a, we test to see
whether the difference in the effects were reliably different than zero: testing experimental (A +
C), and control (A) groups, and asking whether a + c ¿=? a, that is does the difference in
outcome between experimental and control groups equal zero: (a + c) - a = 0? Since the
difference will almost always be different from 0 due to random variation, the question evolves to: Is it sufficiently larger than 0 so that we can have some confidence that the effect is real—that
is, that it will replicate? We want to know whether (a + c) - a is larger than some criterion How
large should that criterion be?
It was for such situations that Fisher formalized and prior work into the analysis of
variance ANOVA ANOVA estimates the background levels of variability—error, or noise—
combining the variance within each of the groups studied, and asking whether the variability between groups the treatment effect, or signal sufficiently exceeds that noise The signal-to-noise ratio is the F statistic If the errors are normally distributed and the groups independent,
with no true effect (that is, all are drawn from the same population, so that A + C = A, and thus c
≈ 0) we can say precisely how often the F ratio will exceed a criterion α (alpha) If our treatment effect exceeds that value, it is believed to be unlikely that the assumption of “no effect” is true
Trang 3Because of its elegance, robustness, and refinement over the decades, ANOVA and its variants are the most popular inferential statistics in psychology These virtues derive from certain
knowledge of the ideal case, the null hypothesis, with deviations being precisely characterized by p-values significance levels
But there are problems associated with the uncritical use of such null-hypothesis statistical tests (NHST), ones well known to the experts, and repeated anew to every generation of students
(e.g., Krueger 2001) Among them: One cannot infer from NHST either the truth or falsity of the null hypothesis; nor can one infer the truth or falsity of the alternative ANOVA gives the
probability of data assuming the null, not the probability of the null given the data (see, e.g., Nickerson 2000) Yet, rejection of the null is de facto the purpose to which the results are
typically put Even if the null is (illogically) rejected, significance levels do not give a clear
indication of how replicable a result is It was to provide such a measure that prep, the probability
of replication, was introduced (Killeen 2005a)
The logic of prep
Prep is a probability derived from a Bayesian posterior predictive distribution (ppd) Assume you
have conducted a pilot experiment on a new treatment for alleviating depression, involving 20 control and 20 experimental participants, and found that the means and standard deviations were:
40 (12), 50 (15) The effect size, d, the difference of means divided by the pooled estimate of standard deviation (13.6), is a respectable 0.74 Your t-test reports p < 05; indicating that this
result is unlikely under the null hypothesis Is the result replicable? The answer depends on what you consider a replication to be, and what you are willing to assume about the context of the experiment First the general case, and then the particulars
Most psychologists know that a sampling distribution is the probability of finding a
statistic such the effect size, d given the “true” value of the population parameter, δ (delta) Under the null, δ = 0: The sampling distribution, typically a normal or t-distribution, is centered
on 0 If the experimental and control groups are the same size and sum to n, then the variance of the distribution is approximately 4/(n - 4) This is shown in Figure 1 The area to the right of the initial result, d1 = 0.74, is less than α = 05, so the result qualifies as significant To generate a predictive distribution, move the sampling distribution from 0 to its most likely place Given
knowledge of only your data, that is the obtained effect size, d1 = 0.74 If this was the true effect
Trang 4size δ, then that shifted sampling distribution would also give the probability of a replication: The probability that it would be significant is the area under this distribution that lies to the right
of the α cut-off, 0.675, approximately 58% The probability of a replication returning in the
wrong direction is the area under this curve to the left of 0—which equals the 1-tailed p-value of
the initial study
Figure 1 The curve centered on 0 is a sampling distribution for effect size, d, under the null
hypothesis Shifted to the right it gives the predicted distribution of effect sizes in replications, in
case the true effect size, δ, equals the recorded effect size d1 Since we do not know that δ
precisely equals d1, because both the initial and replicate will incur sampling error, the variance
of the distribution is increased—doubled in the case of an equal-powered replication, to create
the posterior predictive distribution (ppd), the intermediate distribution on the right In the case
of a conceptual rather than strict replication, additional realization variance is added, resulting in
the lowest ppd In all cases, the area under the curves to the right of the origin gives the
probability of supportive evidence in replication
If we knew that the true effect size was exactly δ = 0.74, no further experiments would be necessary But we do not know what δ is; we can only estimate it from the original results There are thus at least two sources of error: the sampling error in the original, and in the ensuing
replication This leads to a doubling of the variance in prediction, for a systematic replication
Trang 5attempt of the same power The ppd is located at the obtained estimate of d, d1, and has twice the variance of the sampling distribution—8/(n-4) The resulting probability of achieving a
significant effect in replication, the area to the right of 0.675, shrinks to 55%
What constitutes evidence of replication?
What if an ensuing replication found an effect size of 0.5? That is below your estimate of 0.74, and falls short of significance Is this evidence for or against the original claim? It would
probably be reported as “failure to replicate” But that is misleading: If those data had been part
of your original study, the increase of n would have more than compensated for the decrease in d,
substantially improving the significance level of the results The claim was for a causal factor, and the replication attempt, though not significant, returned evidence that (weakly) supports that
claim It is straightforward to compute the probability of finding supporting evidence of any
strength in replication The probability of that a positive effect in replication is the area under
the ppd to the right of 0 In this case that area is 94, suggesting a very good probability that in
replication the result will not go the wrong way and contradict your original results This is the
basic version of prep
What constitutes a replication?
The above assumed that the only source of error was sampling variability But there are other sources as well, especially in the most useful case of replication, a conceptual replication
involving a different population of participants, and different analytic techniques Call this
“random effects” variability realization variance, here σ2R In social science research it is
approximately σ2R = 0.08 across various research contexts This noise reduces replicability,
especially for studies with small effect sizes, by further increasing the spread of the ppd The
median value of σ2R = 0.08 limits all effect sizes less than 0.5 to prep < 90, no matter how many
data they are based on In the case of the above example, it reduces prep from 94 to 88, so that
there is 1 chance in 8 that a conceptual replication will come back in the wrong direction
What is the best predictor of replicability?
In the above example all that was known were the results of the experiment: We assumed “flat priors” a priori ignorance of the probable effect size In fact, however, more than that is
typically known, or suspected; the experiment comes from a research tradition in which similar kinds of effects have been studied If the experiment had concerned the effect of the activation of
Trang 6a randomly chosen gene, or of a randomly chosen brain region, on a particular behavior, the prior
distribution would be tightly centered close to 0, and the ppd would move down toward 0 If,
however, the experiment is studying a large effect that had been reported by 3 other laboratories,
the priors would be centered near their average effect size, and the ppd moved up toward them
The distance moved depends on the relative weight of evidence in the priors and in the current data Exactly how much weight should be given to each is a matter of art and argument The answer depends largely on which of the following three questions is on the table:
How should I evaluate this evidence? To avoid capricious and ever-differing evaluations of
replicability of results due to diverse subjective judgments of the weight of more or less relevant
priors, prep was presented for the case of flat, ignorance priors This downplays precision of
prediction in the service of stability and generality of evaluation; it decouples the evaluation of new data from the sins, and virtues, of their heritage It uses only the information in the data at
hand, or that augmented with a standardized estimate of realization variance
What should I believe? Here priors matter: Limiting judgment to only the data in hand is
shortsighted If a novel experiment provides evidence for extra-sensory pre-cognition, what you should believe should be based on the corpus of similar research, updated by the new data In this case, it is likely that your priors will dominate what you believe
What should I do? NHST is of absolutely no value in guiding action, as it gives neither the
probability of the null nor of the alternative, nor can it give the probability of replication, which
is central to planning Prep is designed to predict replicability, and has been developed into a decision theory for action (Killeen 2006) Figure 2 displays a ppd and superimposed utility
functions that describe the value, or utility, of various effect sizes To compute expected value, integrate the product of the utility function with the probability of each outcome, as given by the
ppd The utility shown as dashed lines is 0 until effect size exceeds 0, then immediately steps to
1 Its expected value is prep, the area under the curve to the right of 0 Prep has a 1-to-1
relationship with the p-value Thus, NHST (and prep when σ2R is 0) is intrinsically indifferent to size of effect, giving equal weighting to all positive effect sizes, and none to negative ones
Trang 7Figure 2 A candidate utility function is drawn as a power function of effect size (ogive) The value of an effect increases less than proportionately with its size The expected utility of a future
course of action is the probability of each particular outcome (the ppd) multiplied by the utility
function—the integral of the product of the two functions Because traditional significance tests give no weight to effect size, their implicit utility function is flat (dashed) If drawn as a line at -7
up to the origin, and then at 1 to the right, it sets a threshold for positive utility in replication at the traditional level α = 0.05 Other exponents for the utility function return other criteria, such
as the Akaike criterion and the Bayesian information criterion The ogive gives approximately equal weight to effect size and to replicability
If the weight on negative effect sizes were -7, then the expected utility of an effect in
replication would be negative for all ppd whose area to the left of the origin was greater than 1/7
This sets a criterion for positive action that is identical to the α = 05 criterion Conversely, this traditional criterion α = 05 de facto sets the disutility of a false alarm as seven times the utility
of a hit α = 01 corresponds to a 19/1 valuation of false positives to true positives This
exposition thus rationalizes the α levels traditional in NHST
Economic valuations are never discontinuous like these step functions; rather they look more like the ogive, shown in Figure 2, which is a power function of effect size To raise
expected utility above a threshold for action, such ogives require more accuracy—typically
Trang 8larger n—when effect sizes are small than does NHST; conversely large effect sizes pass criteria with smaller values of n—and replicability Depending on the exponent of the utility function, it
will emulate traditional decision rules based on AIC, BIC, and adjusted coefficient of
determination In the limits, as the exponent approaches 0 it returns the traditional step function
of NHST, indifferent to effect size; as it approaches 1, only effect size, not replicability, matters
A power of 1/3 weights them approximately equally Thus prediction, built upon the ppd, and
modulated by the importance of potential effects, can guide behavior; NHST can not
How reliable are predictions of replicability?
Does positive psychology enhance well-being, or ameliorate depressive symptoms? A recent meta-analysis of positive psychology interventions found a mean effect size of 0.3 for both dependent variables over 74 interventions (Sin and Lyubomirsky 2009) With an average of 58 individuals per condition, and setting σ2R = 0.08, prep is 88 Of the 74 studies, 65 should
therefore have found a positive effect 66 found a positive effect Evaluation of other
meta-analyses shows similar high levels of accuracy for prep’s predictions
We may also predict that 1 of the studies in this ensemble should have gone the wrong way
strongly (its prep >0.85 for a negative effect) What if yours had been one of the 8 studies that showed no or negative effects? The most extreme negative effect had a prep of a severely
misleading 88 (for negative replicates)! Prep gives an expected, average estimate of replicability (Cumming 2005); but it, like a p-value, typically has a high associated variance (Killeen 2007) It
is because we cannot say beforehand whether you will be one of the unlucky few, that some experts (e.g., Miller 2009) have disavowed the possibility of predicting replicability in general, and of individual research results in particular Those with a more Bayesian perspective are willing to bet that your results will not be the most woeful of the 74, but rather closer to the typical It is your money, to bet or hold; but as a practitioner, you must eventually recommend a course of action Whereas reserving judgment is a traditional retreat of the academic, it is can be
an unethical one for the practitioner Prep, used cautiously, provides a guide to action
What else can be done with the ppd?
Replicability intervals While more informative than p-values, confidence intervals are
underused and generally poorly understood Replicability intervals delimit the values within which a replication is likely to fall 50% replicability intervals are approximately equal to the
Trang 9standard error of the statistic These traditional measures of stability of estimation may be
centered on the statistic, and de facto constitute the values within which replications will fall half the time
Multiple comparisons If a number of comparisons have been performed, how do we decide if
the ensemble of results is replicable? We are appropriately warned against alpha inflation in such
circumstances, and similar considerations affect prep But some inferences are straightforward If
the tests are independent (as assumed, for example in ANOVA), then the probability of a
replication showing all effects to be in the same direction (or significant, etc.) is simply the product of the replicabilities of all individual tests The probability that none will again achieve your definition of replication is the complement of the product of the complements of each of the
preps Is there a simple way to recalibrate the replicability of one of k tests, post hoc? If all the
tests asked the exactly same question that is, constituted within-study replications—the
probability that all would replicate is the focal prep raised to the k th power This conservative adjustment is similar in spirit to the Šidák correction, and suitably reins-in predictions of
replicability for a post-hoc test
Model comparison and longitudinal studies Ashby and O’Brien (2008) have generalized the use
of prep for the situation of multiple trials with a small number of participants, showing how to evaluate alternate models against different criteria (e.g., AIC, BIC) Their analysis is of special interest both to psychophysicists, and to clinicians conducting longitudinal studies
Diagnosticity vs detectability Tests can succeed in two ways: they can affirm when the state of the world is positive (a hit), and they can deny when it is negative (a correct rejection) Likewise they can fail in two ways: affirm when the state of the world is negative (a false alarm, a Type I error), and deny when it is positive (a miss, a Type II error) The detectability of a test is its hit
rate; the diagnosticity is its correct rejection rate Neither alone is an adequate measure of the quality of a test: Detectability of a test can be perfect if we always affirm, driving the
diagnosticity to 0—We can detect 100% of children with ADHD if the test is “Do they move?”
A Relative Operating Characteristic, or ROC gives the hit rate as a function of the false alarm rate The location on the curve gives the performance for a particular criterion If the criterion for false alarms is set at α = 05, then the ordinate gives the power of the test But that criterion is arbitrary What is needed to evaluate a test is the information it conveys independently of the particular criterion chosen The area under the ROC curve does just that: It measures the quality
Trang 10of the test independently of the criterion for action Irwin (2009) has shown that this area is
precisely the probability computed by prep: prep thus constitutes a criterion-free measure of the
quality of a diagnostic test
Efficacy vs effectiveness In the laboratory an intervention may show significant effects of good
size—its efficacy—but in the field its impact—its effectiveness—will vary, and will often
disappoint There are many possible reasons for this difference, such as differences in the skills
of administering clinicians, the need to accommodate individuals with comorbidities, and so on These variables increase realization variance, and thus decrease replicability Finding that
effectiveness is generally less than efficacy is but another manifestation of realization variance
How can prep improve the advice given to patients?
What is the probability that a depressed patient will benefit from a positive psychology
intervention? A representative early study found that group therapy was associated with a
significant decrease in Beck Depression Inventory scores for a group of mildly to moderately depressed young adults The effects were enduring, with an effect size of 0.6 at 1-year posttest
Assuming a standard realization variance of 0.08, prep is 81 But that is for an equal-powered replication What is the probability that your patient could benefit from this treatment? Here we
are replicating with an n of 1, not 33 Instead of doubling the variance of the original study, we must add to it the variance of the sampling distribution for n = 1; that is, the standard deviation
of effect size, 1 This returns a prep of 0.71 Thus, there is about a 70% chance that positive psychotherapy will help your patient for the ensuing year which, while not great, may be better
than the alternatives Even when the posterior is based on all of the data in the meta-analysis, n >
4000, it does not change the odds for your patient, as that is here limited by the effect sizes for these interventions, and your case of 1 You nonetheless have an estimate to offer her, insofar as
it may be in her interest
Why is prep controversial?
The original exposition contained errors (Doros and Geier 2005), later corrected (Killeen 2005b,
2007) Analyses show that prep is biased when used to predict the coincidence in the sign of the
effects of two future experiments; strongly biased when the null is true (Miller and Schwarz
2011) or when the true effect size is stipulated But prep is not designed to predict the coincidence