Power and effect size
In this section we return to 2 basic concepts which bear on interpreting
ANOVA results: power and effect size. Power is the ability to detect an effect if
there is one. Expressed as a quantity, power ranges
from 0 to 1, where .95 would mean a 5% chance of failing
to detect an effect that is there. Although you might
expect that in principle we’d be as demanding of power as we are of alpha,
and thus routinely specify a power of .95, in fact common practice (adopted
from Cohen’s influential 1988 book, Statistical Power Analysis for the
Behavioral Sciences) is to work harder against Type I error than Type
II – such that a power of .80 is generally considered acceptable. What that means is that if your test has a power of .80
or greater, and you do not obtain a significant difference, then it’s a reasonable
to conclude that there is none.
Denenberg,
a power guy, suggests this wording for using a NS test to justify accepting
the null hypothesis:
"We have designed
an experiment with .8 probability of finding a significant difference, if
such exists in the population. Because we failed to
find a significant effect, we think it quite unlikely that one exists. Even if it does exist, its contribution would appear to
be minimal."
How,
in practice, do we control the power of a test? It’s
obvious how we control alpha – we just say what level we require, and then
we compare the result of the test with our a priori specification. StatView and SPSS are set up to allow you to do the same
with respect to power: they give as output a supposed computed power of an
analysis, which you can check against your criterion. For
example, SPSS has it as an option (after
DEFINE): Options – Display – Observed power.
However, this posthoc
or “retrospective” consideration of power (“Was my experiment in fact powerful
enough to have detected an effect?”) is apparently a bad idea. Retrospective power, calculated for an analysis using the
effect size etc. of that analysis, basically gives the same information as
the p-value. We aren’t supposed
to use power retrospectively, to see how good our experiment was; instead
we are supposed to use it prospectively,
to figure out how to do the experiment.
What that means
is figuring out how many subjects you need to run to guarantee a certain level
of power. To do that, you have to have some estimate
of the expected effect size. How?
From previous but similar studies, or from your own pilot work. Given such an estimate, you can calculate, or look up,
the minimum number of subjects needed. (There are
programs to do this, even applets on the web.) This
is something then that you report in a proposal for an experiment: that your
proposed number of subjects is sufficient given a power analysis. Then afterwards you can make the sort of statement exemplified
above – that your experiment was designed to have a given power.
Power depends in
part on effect size, because the smaller
an effect, the more observations you need to establish its existence. There are measures of effect size (Cohen’s d and f) used
for calculating power/minimum sample size. These measures
are NOT the same as the p-value (a smaller p-value does NOT mean a bigger
effect, because p depends on the sample size as well as the effect size),
or as the raw difference between the sample means (because that doesn’t take
into account the underlying population variability).
However, different
measures of effect size are generally used
for retrospective consideration of the importance of an effect. For describing how meaningful your result is in terms of
how large your effect is, the most popular measures are based on the degree
of association between an effect and a dependent variable; and when squared
these give the proportion of variance accounted
for (these are generalizations of r and r2, and of R and
R2, in correlation/regression analysis). Eta-squared,
the “correlation ratio”, is one such measure, which for small effects is about
equal to Cohen’s effect size measure f2. However,
it estimates for the sample and therefore has a positive bias; omega-squared
is more complex but estimates for the population and is unbiased. It seems to be the preferred measure.
However, SPSS gives instead the “partial
Eta squared” measure, which is calculated a little differently from
Eta squared so that it is not dependent on how many factors there are – it
gives the contribution of each factor or interaction, taken as if it were
the only variable, so that it is not masked by any more powerful variable
-- but as a result it comes out the highest of these
three measures and the values for the various factors sum to more than 100%. See the tutorial by Young (1993) in JSHR 36: 644-56, “Supplementing
Tests of Statistical Significance: Variation Accounted For”.
This measure (partial-Eta-squared,
“the proportion of total variability attributable to a factor”) is given in
the same output table as the power. Here is the output
from RM analysis of our file 3-level
compact.xls:
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source |
|
Type III Sum of Squares |
df |
Mean Square |
F |
Sig. |
Partial Eta Squared |
Noncent. Parameter |
Observed Power |
FACTOR1 |
|
|
|
|
|
|
|
|
|
Sphericity
Assumed |
928.533 |
2 |
464.267 |
4.725 |
.044 |
.542 |
9.449 |
.614 |
|
|
|
|
|
|
|
|
|
|
|
Greenhouse-Geisser |
928.533 |
1.269 |
731.910 |
4.725 |
.077 |
.542 |
5.994 |
.454 |
|
|
|
|
|
|
|
|
|
|
|
Huynh-Feldt |
928.533 |
1.590 |
583.935 |
4.725 |
.060 |
.542 |
7.513 |
.529 |
|
|
|
|
|
|
|
|
|
|
|
Lower-bound |
928.533 |
1.000 |
928.533 |
4.725 |
.095 |
.542 |
4.725 |
.384 |
|
Error(FACTOR1) |
|
|
|
|
|
|
|
|
|
Sphericity Assumed |
786.133 |
8 |
98.267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Greenhouse-Geisser |
786.133 |
5.075 |
154.916 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Huynh-Feldt |
786.133 |
6.361 |
123.596 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lower-bound |
786.133 |
4.000 |
196.533 |
|
|
|
|
|
|
a Computed using alpha = .05
Notice that this
measure, like p, depends on the corrected df.
The point of effect
size is that sometimes a factor can have an effect which is statistically
significant, but small – so small that you have to wonder about its overall
importance in determining behavior. This is a theoretical
issue, however. A small effect may have important consequences
if it distinguishes two models, while a large effect may not matter if it
was already expected under any theory.
A suggested way
of reporting effect size information when a variable has a significant effect
but the effect is small:
“Although the ANOVA showed
that the means were significantly different (F...= ...), the effect size was
small to modest. The partial Eta squared was just .02,
which means that the factor X by itself accounted for only 2% of the overall
(effect+error) variance.”
a
tutorial for speech scientists on effect size: T. Meline and
J. F. Schmitt (1997), Case Studies for Evaulating Statistical Significance
in Group Designs, American Journal of Speech-Language Pathology 6: 33-41 (not
available through UCLA, but ask Pat for a copy)