The End of the Era of Statistical Significance (and Why You Should Care)

The editors of the official journal of the American Statistical Association have put a spike in the heart of a monster of their own creation. The monster is the test of statistical significance, a measure of the quality of a sample of evidence, and a magical line that accepts results when p<.05 and rejects evidence when p>.05. Let us all toast its demise. You can read their editorial here. To emphasize the importance of their rejection of this damaging standard, they published 43 supporting articles, and shared the entire issue of their journal freely to the public.

This bad practice is a monster that has haunted statistical interpretation for 100 years. It has terrified scholars who would only submit their papers for publication with the monster at their side. And it has led to errors in judgment about the effectiveness of drugs (i.e., Vioxx), the wisdom of unemployment policies, and the effect of smaller class sizes on elementary school students.

Why does this matter to those of you leading schools and districts? It means that the last 100 years of education research has been filled with studies that have trumpeted big successes but whose evidence may neither affirm nor negate the treatment under study. Because the degree of effect and its importance have often been pushed to the side by formal regard for statistical significance, this means diligent practitioners need to reexamine any research you’re counting on.

I checked one journal, the American Education Research Journal, published by the prestigious AERA, to see how many articles their editors published since 2010 that relied upon p-values in any way. The answer: 402 articles out of 452 articles published.

Here’s what will change in the statistical profession. Research that gets published in the journal of ASA (The American Statistician) will no longer get attention solely because of the formal, statistical soundness of its evidence. Published research will have to meet more practical tests: how relevant is the study; how generalizable are its findings; how big an effect does a treatment have.

“In particular, the use and misuse of P values is, arguably, the most widely perpetrated misdeed of statistical inference across all of science.” – John P. A. Ioannidis, Depts. Of Medicine, Health Research and Policy, Biomedical Data Science and Statistics, Stanford University, “What Have We Not Learnt from Millions of Scientific Papers with P Values?”

This “aha!” moment has been a long time coming. Back in 1996, two critics of the field’s prevailing wisdom, Stephen Ziliak (then a grad student) and Deirdre McCloskey (his dissertation advisor), co-authored a paper, “The Standard Error of Regressions,” which was published in the Journal of Economic Literature. The paper is one of the most frequently cited in their field. They followed this up in 2007 with a book, The Cult of Statistical Significance. The book garnered a great deal of attention, and stimulated much debate. However, those who favored the broad critique of Ziliak and McCloskey far outnumbered those who defended the old order.

So now, 23 years after Ziliak and McCloskey’s article, professional statisticians have the guidelines from their association that the old rules will no longer be used to judge articles they submit for publication; or be used to determine the meaning of their research findings. In fact, these editors have told their members to stop using statistical significance altogether.

Let’s welcome this advance, even if it has been too long in coming. Consider this to be a “buyer beware” alert to all of us who depend upon education research. Next time you sit down to consider improvements, and someone brings research to the table to justify their recommendation, put on your skeptic’s hat, and start asking questions.