Suppose now that instead of wanting to test a sample mean against a population mean, you would like to compare two sample means, each arising from independent groups, to see if they reasonably could have been drawn from the same population. For this, a two‐sample t‐test will be useful. We again borrow hypothetical data from Denis (2016), this time on grade (pass vs. fail) and minutes studied for a seminar course:
where “0” represents a failure in the course and “1” represents a pass. The null hypothesis we wish to evaluate is that the population means are equal, against a statistical alternative that they are unequal:
H H
0 1 2
1 1 2
: :
The t‐test we wish to perform is the following:
t y y
s
n s
n
1 2
12 1
12 2
evaluated on (n1 − 1) + (n2 − 1) degrees of freedom. Had our sample sizes been unequal, we would have pooled the variances, and hence our t‐test would have been
t y y
sp n n
1 2
2
1 2
1 1
where s2p is equal to s n s n s
p n n
2 1 12
2 22
1 2
1 1
2 . Notice that under the situation of equal sample size per group, that is, n1 = n2, the equation for the ordinary two‐sample t‐test and the pooled version will yield the same outcome. If, however, sample sizes are unequal, then the pooled version should be used. Independent‐samples t‐tests typically require populations in each group to be normal, the assumptions of independence of observations and homogeneity of variance, which can be assessed as we will see through Levene’s test.
To perform the two‐sample t‐test in SPSS:
ANALYZE→ COMPARE MEANS→ INDEPENDENT‐SAMPLES T‐TEST
We move over studytime to the Test Variable(s) box and grade to the Grouping Variable box. The reason why there are two “??” next to grade is because SPSS requires us to specify the numbers that represent group membership that we are comparing on the independent variable. We click on Define Groups:
Make sure Use specified values is selected;
under Group 1, input a 0 (since 0 corresponds to those failing the course), and under Group 2, a 1 (since 1 corresponds to those passing the course).
Under Options, we again make sure a 95% con- fidence interval is selected, as well as excluding cases analysis by analysis.
T-TEST GROUPS=grade(0 1) /MISSING=ANALYSIS /VARIABLES=studytime /CRITERIA=CI(.95).
SPSS provides us with some descriptive statistics above, including the sample size, mean for each sample, standard deviation, and standard error of the mean for each sample. We can see that the sample mean minutes studied of those who passed the course (123.0) is much higher than the sample mean minutes of those who did not pass (37.4).
The actual output of the independent‐samples t‐test follows:
studytime Equal variances assumed
F Sig.
3.541 .097 –5.351
–5.351 8 5.309
.001 .003
–85.60000 –85.60000
15.99562 15.99562
–122.48598 –126.00773
–48.71402 –45.19227 t df Sig. (2-tailed) Mean
Difference Std. Error
Difference Lower Upper
95% Confidence Interval of the Difference t-test for Equality of Means
Levene’s Test for Equality of Variances
Equal variances not assumed
Independent Samples Test
We interpret the above output:
● The Levene’s test for equality of variances is a test of the null hypothesis that the variances in each population (from which the samples were drawn) are equal. If the p‐value is small (e.g. <0.05), then we reject this null hypothesis and infer the statistical alternative that the variances are une- qual. Since the p‐value is equal to 0.097, we have insufficient evidence to reject the null hypothesis;
hence, we can move along with interpreting the resulting t‐test in the row equal variances assumed. (Note however that the variance in grade = 1 is quite a bit larger than the variance in grade = 0, almost six times as large, which under most circumstances would lead us to interpret the equal variances not assumed line. However, for our very small sample data, Levene’s test is likely underpowered to reject the null, so for consistency of our example, we interpret equal variances assumed.)
● Our obtained t is equal to −5.351, on 8 degrees of freedom (computed as 10‐2), with an associated p‐value of 0.001. That is, the probability of obtaining a mean difference (of −85.60) such as we have observed when sampling from this population is approximately 0.001 (about 1 in 1000). Since such
grade studytime .00
1.00 5 5
37.4000 123.0000
13.57571 33.09078
6.07124 14.79865 N Mean Std. Deviation Std. Error
Mean Group Statistics
T-Test
An independent‐samples t‐test was conducted comparing the mean study time of those having passed (1) vs. failed (0) the course. The sample mean of those having passed was equal to 123.0, while the sample mean of those failing the course was 37.4. The difference was found to be statis- tically significant (p = 0.001, equal variances assumed). A 95% confidence interval was also computed revealing that we could be 95% confident that the true mean difference lies between −122.49 and −48.71.
An effect size measure was also computed. Cohen’s d, computed as the difference in means divided by the pooled standard deviation, was equal to 3.38, which in most research settings is considered a very large effect. Cohen (1988) suggested conventions of 0.2 as small, 0.5 as medium, and 0.8 as large, though how
“big” an effect size is depends on the research area (see Denis (2016), for a discussion).
a difference is so unlikely under the null hypothesis of no mean difference, we reject the null hypothesis and infer the statistical alternative hypothesis that there is a mean difference in the population or, equivalently, that the two sample means were drawn from different populations.
● SPSS then gives us the mean difference of −85.60, with a standard error of the difference of 15.995.
● The 95% Confidence Interval of the Difference is interpreted to mean that in 95% of samples drawn from this population, we would expect the true mean difference to lie between −122.48 and
−48.71. We can see that the value of 0 is not included in the interval, which means we can reject the null hypothesis that the mean difference is equal to 0 (i.e. 0 lies on the outside of the interval, which means it is not a plausible value of the population mean difference).
● Cohen’s d, a measure of effect size, is computed as the difference in means in the numerator divided by the pooled standard deviation, which yields 3.38, which is usually considered to be a very large effect (it corresponds to a correlation r of approximately r = 0.86). Cohen (1988) sug- gested conventions of 0.2 as small, 0.5 as medium, and 0.8 as large, though how “big” an effect size is depends on the research area (see Denis (2016), for a discussion).
There are also nonparametric alternatives to t‐tests when assumptions are either not met, unknown, or questionable, especially if sample size is small. We discuss these tests in Chapter 14.
When we speak of the power of a statistical test, informally, we mean its ability to detect an effect if there is in actuality an effect present in the population. An analogy will help. Suppose as a microbiologist, you place some tissue under a microscope with the hope of detecting a virus strain that is present in the tissue. Will you detect it? You will only detect it if your microscope is powerful enough to see it. Otherwise, even though the strain may be there, you will not see it if your microscope is not powerful enough. In brief then, you are going to need a sufficiently power- ful tool (statistical test) in order to detect something that exists (e.g. virus strain), assuming it truly does exist.
The above analogy applies to basic research as well in which we are wanting to estimate a param- eter in the population. If you wish to detect a mean population difference between males and females on the dependent variable of height, for instance, you need a sufficiently powerful test in order to do so. If your test lacks power, it will not be able to detect the mean difference even if there is in actuality a mean difference in the population. What this translates into statistically is that you will not be able to detect a false null hypothesis so long as you lack sufficient power to be able to do so. Formally, we may define power to be the following:
Statistical power is the probability of rejecting a null hypothesis given that it is false.
How do we make sure our statistical tests are powerful? There are a few things that contribute to the power of a statistical test:
1) Size of effect – all else equal, if the size of effect is large, you will more easily detect it com- pared with if it is small. Hence, your statistical test will be more powerful if the size of effect is presumed to be large. In a two‐sample t‐test situation, as we have seen, the size of effect can be conceptualized as the distance between means (divided by a pooled standard deviation). All else equal, the greater the distance between means, the more powerful the test is to detect
6
Power Analysis and Estimating Sample Size
such a difference. Effect sizes are different depending on the type of test we are conducting. As another example, when computing a correlation and testing it for statistical significance, the effect size in question is the size of the anticipated coefficient in the population. All else equal, power is greater for detecting larger correlations than smaller ones. If the correlation in the population is equal to 0.003, for instance, power to detect it will be more difficult to come by, analogous to if the strain under the microscope is very tiny, you will need a very sensitive microscope to detect it.
2) Population variability – the lesser the variability (or “noise”) in a population, the easier it will be to detect the effect, analogous to detecting the splash a rock makes when hitting the water is easier to spot in calm waters than if the waters are already turbulent. Population variability is usu- ally estimated by variability in the sample.
3) Sample size – the greater the sample size, all else equal, the greater will be statistical power.
When it comes to power then, since researchers really have no true control over the size of effect they will find, and often may not be able to reduce population variability, increasing sample size is usually the preferred method for boosting power. Hence, in discussions of adequate statisti- cal power, it usually comes down to estimating requisite sample size in order to detect a given effect. For that reason, our survey on statistical power will center itself on estimating required sample size.
We move directly to demonstrating how statistical power can be estimated using G*Power, a popular software package specially designed for this purpose. In this chapter, we only survey power for such things as correlations and t‐tests. In ensuing chapters, we at times include power estimation in our general discussion of the statistical technique. As we’ll see, the principles are the same, even if the design is a bit different and more complex. Keep in mind that estimating power is only useful typically if you can compute it before you engage in the given study, so as to assure yourself that you have an adequate chance at rejecting the null hypothesis if indeed it turns out to be false.