On Power and Science

Tags:

Power analysis helps you to design a perfect study to answer the question "is my model better than random, p<0.05?" But is that all we wanted to know in science?

Russ Lenth’s power and sample-size page, via Andrew Gelman.  

 

Java applets for power and sample size

Advice

Here are two very wrong things that people try to do with my software:
  • Retrospective power (a.k.a. observed power, post hoc power).  You’ve got the data, did the analysis, and did not achieve "significance."  So you compute power retrospectively to see if the test was powerful enough or not.  This is an empty question.  Of course it wasn’t powerful enough — that’s why the result isn’t significant.  Power calculations are useful for design, not analysis.
    (Note: These comments refer to power computed based on the observed effect size and sample size.  Considering a different sample size is obviously prospective in nature.  Considering a different effect size might make sense, but probably what you really need to do instead is an equivalence test; see Hoenig and Heisey, 2001.)

  • Specify T-shirt effect sizes ("small", "medium", and "large").  This is an elaborate way to arrive at the same sample size that has been used in past social science studies of large, medium, and small size (respectively).  The method uses a standardized effect size as the goal.  Think about it: for a "medium" effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects.  Clearly, important considerations are being ignored here.  "Medium" is definitely not the message!
Here are three very right things you can do:
  • Use power prospectively for planning future studies.  Software such as is provided on this website is useful for determining an appropriate sample size, or for evaluating a planned study to see if it is likely to yield useful information.

  • Put science before statistics.  It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis).  The appropriate inputs to power/sample-size calculations are effect sizes that are deemed clinically important, based on careful considerations of the underlying scientific (not statistical) goals of the study.  Statistical considerations are used to identify a plan that is effective in meeting scientific goals — not the other way around.
  • Do pilot studies.  Investigators tend to try to answer all the world’s questions with one study.  However, you usually cannot do a definitive study in one step.  It is far better to work incrementally.  A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size.  A pilot study with 20-30 degrees of freedom for error is generally quite adequate for obtaining reasonably reliable sample-size estimates. 

 References:

Lenth, R. V. (2001), "Some Practical Guidelines for Effective Sample Size Determination," The American Statistician, 55, 187-193.

Hoenig, John M. and Heisey, Dennis M. (2001), "The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis,'’ The American Statistician, 55, 19-24.

Russ Lenth left a comment (the first comment) on Gelman’s post. He complained that "people in the pharmaceutical industry and in the social sciences drive me batty, but for opposite reasons." The former is not a concern here emoticon. But social scientists insist on (a) doing everything on the standardized scale and (b) calculating power on Cohen’s T-shirt scale. He asked:

I am not deeply involved in social-science applications. But what has recently occurred to me is that perhaps the reason my anti-Cohen opinions are such an uphill battle is that maybe social scientists believe that standardization is the true path toward such ideals as "objectivity" and "validity" — that somehow, making a judgment about the actual size of an effect, on the actual scale of measurement, is somehow wrong. Is this true? I’d be interested in hearing from people who do serious social-science work.

I don’t know the answer… but the question seems to presuppose that all we do in social sciences is significance tests. Lenth strongly advocates for "putting science before statistics." But what if the science demands no significant testing?

In a comment on Kevin’s comment on my previous post, I said that "power analyses is probably the worst remedy for the problem." The problem I referred to is the confusion of statistical significance with practical significance. If I understand correctly, Lenth would advise the following:

  1. determine what is a practically (clinically, in his word) significant difference PRIOR to conducting the study
  2. carry out the power analysis, determine the sample size N that gives you 80% chance of detecting (with a statistically significant result) the difference (or whatever other criteria).
  3. Collect data
  4. Do your normal significant test, and if it’s significant, conclude that the data support this practically/clinically significant difference; else, say no.

Gelman acknowledged that this is not the typical way power analysis is done in social sciences. Often you start with a fixed parameter of sample size — which is often a constraint that is hard to overcome — and then find out how small a difference the design has the power to pick up. In other words, there is no agreed-upon standard of what is a practically/clinically important effect size.

There are 2 consequences of this lack of a standard of practically importance. In all fairness to us social scientists, there cannot be a single authority (is NIH close? peer review as self-policing) that dictates what’s "significant" and what’s not in science.  

  • One is the religous belief that statistical significance is the absolute authority. If you can make even the smallest difference statsitically significant, you are automatically a winner.
  • It also reinforces the implicit assumption that statsitical significance is what science is all about. Only ask questions that can be understood by SPSS (no, SAS  is for pharmaceuticalemoticon). If you can’t play 20 questions with nature and win, play again.
Of the two, I worry more about the latter. The former generates a lot of junk studies, but the latter leaves really interesting scientific questions unasked, questions other than "is my model better than random, p<0.05?"

Leave a Reply

If the above Image does not contain text, use this secure code: 1bYRSC