A thin red line: the distance between statistical significance and insignificance

Tags:

The difference between "statistically significant" and "not statistically significant" is not in itself necessarily statistically significant, says Andrew Gelman. It’s a message we ought to drill in, hammer down, and drive home with in every intro stats course. But unfortunately psychology believes in the magic number .05 more than anything (or anybody) else.

It is common in applied research–in the last couple of weeks, I have seen this mistake made in a talk by a leading political scientist and a paper by a psychologist–to compare two effects, from two different analyses, one of which is statistically significant and one which is not, and then to try to interpret/explain the difference. Without any recognition that the difference itself was not statistically significant.

This is a surprisingly common mistake. The two effects seem sooooo different, that it is hard for people to even think that their difference might be explained purely by chance.

Don’t have time today to write more on this, but I think it woucl be cool to do a different kind of meta analysis to see how many debates in psycholinguistics relied on "significance" that were actually not significant.

I know why nobody else did this — it’s called academic suicide. 

2 Responses to “A thin red line: the distance between statistical significance and insignificance”

  1. Kevin Miller Says:

    In the Cohen & Cohen Regression book they have a nice test for the difference between to (dependent or independent) correlations that I’ve used (i.e., does A predict B better than C does). So it’s an easy problem to solve if you want to accept it as stated, and I’m not sure I do.

    An alpha level of .05 is pretty conservative, by design. Is that the criterion we want to use to decide whether two experimental results differ from each other? it may be, but I worry that cascading of very conservative decision-making criteria will mean that nothing is ever different from anything else.

  2. gary Says:

    Good point on the C&C. Nonetheless, what worries me the most is when the comparision is between studies — one significant and one is not. You can’t do stats unless you have access to the original data from both studies. There might be some clever things people could do, sort like meta-analysis. But the point is, nobody does it. Theoretical debate is a matter of significance vs nonsignificance. End of story.

    Plus, whatever significant test you do (such as the C&C), you still cannot get around the potential problem Gelman raised.

    Someone commented on Gleman’s original page, suggesting the solution is power analysis. It might be from the stats potin of view: increase N until things become significant. But in my mind it’s a Sapire-Whorf problem, and the word “significant” is to blame. Not only we take “statistical significance” as categorical — the essense of Gelman’s argument — but also as “practically significance.” Somebody famous must have made similar comments; I cannot remember whom to attribute to here.

    This relates to Kevin’s comment on whether 0.05 is too conservative. I can’t justify fully but my gut feeling is, no. And if anything, it’s too liberal. Let’s see if I can explain.

    I think we psychologists are just too good at cutting the chase and going straight to testing hypothesis, without “understanding” a problem. In my own research, word frequency is almost always a significant predictor in any comparison of fixation duration (or reaction time) in reading. Big deal. Many prominent models are built around it. But how much variance does it explain? Less than 1/100. It may go down to 1/1000 and you still get a significant finding. What does it tell us about fixation duration? Practically nothing. If you are to bet on how long the next fixation will be, you do just as well not knowing the frequency of the next word.

    Contrast psychology’s approach to an (imagined) engineering one, where your goal is to predict the next fixation. Here you need to set priorities — look for factors that explain the most variance (or whatever things that reduce uncertainty), not just the ones you happen to lay your eyes on. Word frequency would probably be item number 20 down the list.

    What does this has to do with significant tests? Not much. The goal wouldn’t be to find whether two things are different. Rather, it is to identify things that are relevant to the issue, in the order of importance.

    This also has nothing to do with power analysis — with large enough N anything can be significant. In fact, power analyses is probably the worst remedy for the problem Gelman raised.

    I guess it comes down to a gut feeling of discomfort when I sweep a great portion of the variance under the table labelled “MSE” and pretend what’s left in the anova or regression or whatever model contains the essence of the phenomenon. How do I convince myself that it is indeed the essence, everything needs to be known?

    “It’s better than random, p < .05?”

Leave a Reply

If the above Image does not contain text, use this secure code: 2sOqth