Power and effect size


In this section we return to 2 basic concepts which bear on interpreting ANOVA results: power and effect size.  Power is the ability to detect an effect if there is one.  Expressed as a quantity, power ranges from 0 to 1, where  .95 would mean a 5% chance of failing to detect an effect that is there.  Although you might expect that in principle we’d be as demanding of power as we are of alpha, and thus routinely specify a power of .95, in fact common practice (adopted from Cohen’s influential 1988 book, Statistical Power Analysis for the Behavioral Sciences) is to work harder against Type I error than Type II – such that a power of .80 is generally considered acceptable.  What that means is that if your test has a power of .80 or greater, and you do not obtain a significant difference, then it’s a reasonable to conclude that there is none. 

 

Denenberg, a power guy, suggests this wording for using a NS test to justify accepting the null hypothesis:

 

"We have designed an experiment with .8 probability of finding a significant difference, if such exists in the population.  Because we failed to find a significant effect, we think it quite unlikely that one exists.  Even if it does exist, its contribution would appear to be minimal."

 

How, in practice, do we control the power of a test?  It’s obvious how we control alpha – we just say what level we require, and then we compare the result of the test with our a priori specification.   StatView and SPSS are set up to allow you to do the same with respect to power: they give as output a supposed computed power of an analysis, which you can check against your criterion.  For example, SPSS has it as an option (after DEFINE): Options – Display – Observed power.

 

However, this posthoc or “retrospective” consideration of power (“Was my experiment in fact powerful enough to have detected an effect?”) is apparently a bad idea. Retrospective power, calculated for an analysis using the effect size etc. of that analysis, basically gives the same information as the  p-value. We aren’t supposed to use power retrospectively, to see how good our experiment was; instead we are supposed to use it prospectively, to figure out how to do the experiment.

 

What that means is figuring out how many subjects you need to run to guarantee a certain level of power.  To do that, you have to have some estimate of the expected effect size.  How?  From previous but similar studies, or from your own pilot work.  Given such an estimate, you can calculate, or look up, the minimum number of subjects needed.  (There are programs to do this, even applets on the web.)  This is something then that you report in a proposal for an experiment: that your proposed number of subjects is sufficient given a power analysis.  Then afterwards you can make the sort of statement exemplified above – that your experiment was designed to have a given power.

 

Power depends in part on effect size, because the smaller an effect, the more observations you need to establish its existence.  There are measures of effect size (Cohen’s d and f) used for calculating power/minimum sample size.  These measures are NOT the same as the p-value (a smaller p-value does NOT mean a bigger effect, because p depends on the sample size as well as the effect size), or as the raw difference between the sample means (because that doesn’t take into account the underlying population variability).  

 

However, different measures of effect size are generally used for retrospective consideration of the importance of an effect.  For describing how meaningful your result is in terms of how large your effect is, the most popular measures are based on the degree of association between an effect and a dependent variable; and when squared these give the proportion of variance accounted for (these are generalizations of r and r2, and of R and R2, in correlation/regression analysis).  Eta-squared, the “correlation ratio”, is one such measure, which for small effects is about equal to Cohen’s effect size measure f2.  However, it estimates for the sample and therefore has a positive bias; omega-squared is more complex but estimates for the population and is unbiased.  It seems to be the preferred measure.  However, SPSS gives instead the “partial Eta squared” measure, which is calculated a little differently from Eta squared so that it is not dependent on how many factors there are – it gives the contribution of each factor or interaction, taken as if it were the only variable, so that it is not masked by any more powerful variable -- but  as a result it comes out the highest of these three measures and the values for the various factors sum to more than 100%.  See the tutorial by Young (1993) in JSHR 36: 644-56, “Supplementing Tests of Statistical Significance: Variation Accounted For”.

 

This measure (partial-Eta-squared, “the proportion of total variability attributable to a factor”) is given in the same output table as the power.  Here is the output from RM analysis of our file 3-level compact.xls:

 

Tests of Within-Subjects Effects

Measure: MEASURE_1

Source

 

Type III Sum of Squares

df

Mean Square

F

Sig.

Partial Eta Squared

Noncent. Parameter

Observed Power

 

FACTOR1

 

 

 

 

 

 

 

 

 

Sphericity Assumed

928.533

2

464.267

4.725

.044

.542

9.449

.614

 

 

 

 

 

 

 

 

 

 

 

 

Greenhouse-Geisser

928.533

1.269

731.910

4.725

.077

.542

5.994

.454

 

 

 

 

 

 

 

 

 

 

 

 

Huynh-Feldt

928.533

1.590

583.935

4.725

.060

.542

7.513

.529

 

 

 

 

 

 

 

 

 

 

 

 

Lower-bound

928.533

1.000

928.533

4.725

.095

.542

4.725

.384

 

 

Error(FACTOR1)

 

 

 

 

 

 

 

 

 

Sphericity Assumed

786.133

8

98.267

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Greenhouse-Geisser

786.133

5.075

154.916

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Huynh-Feldt

786.133

6.361

123.596

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Lower-bound

786.133

4.000

196.533

 

 

 

 

 

 

 

a  Computed using alpha = .05

 

Notice that this measure, like p, depends on the corrected df.

 

 

The point of effect size is that sometimes a factor can have an effect which is statistically significant, but small – so small that you have to wonder about its overall importance in determining behavior.  This is a theoretical issue, however.  A small effect may have important consequences if it distinguishes two models, while a large effect may not matter if it was already expected under any theory.

 

A suggested way of reporting effect size information when a variable has a significant effect but the effect is small:

 

“Although the ANOVA showed that the means were significantly different (F...= ...), the effect size was small to modest.  The partial Eta squared was just .02, which means that the factor X by itself accounted for only 2% of the overall (effect+error) variance.”

 


 

a tutorial for speech scientists on effect size: T. Meline and J. F. Schmitt (1997), Case Studies for Evaulating Statistical Significance in Group Designs, American Journal of Speech-Language Pathology 6: 33-41 (not available through UCLA, but ask Pat for a copy)