I recently attended a conference where one of the plenary sessions included a discussion of the p-value controversy, and provided suggestions on what we should be teaching and using instead.
When I first spoke about the p-value controversy on this blog (Rao, 2019), I suggested that an understanding of the history of hypothesis testing may help clear students’ confusion. To Fisher, p-values were a measure of likelihood that a current hypothesis or theory could explain observed phenomenon. Only when a hypothesis was so unlikely to be a sufficient explanation of a phenomenon could it be ruled out or rejected. It was with this lens that Fisher described p-values less than 0.05 as significant, choosing 0.05 as a convenient threshold (Fisher, 1925).
Wasserstein, Schirm, and Lazar (2019) promote many alternate statistics, one of which is the effect size. They advise that careful consideration be made to determine what a meaningful effect size would be for each individual study.
However, I am concerned that we are setting ourselves up for an effect-size controversy in the future, similar to the p-value controversy. Cohen’s d (Cohen, 1988) is one of the most famous measures of effect size, and comes with convenient thresholds for small, medium, and large effects. Sullivan & Feinn (2012) summarize many different measures of effect size, and for all, include thresholds for determining the size of the effect.
If 0.2 is a small effect, and 0.5 is a medium effect, what is 0.35? Is 0.1 still a small effect, or is it no effect? What about 0.05?
In my mind, these alternate procedures do not solve the problem at the root of the p-value controversy, it just re-directs it. I still don’t know of any perfect solution, but I believe Wasserstein, Schirm, and Lazar’s (2019) recommendation to be thoughtful is what we must seek to impart on our students.
My favourite summary of recommendations thus far is one by Dr Andrew Zieffler (citation listed below). I plan on requiring my students to review these slides after reading the Biau, Jolles, & Porcher (2010) article, in an attempt to help set the stage for thoughtful interaction with statistical tools.
References and further reading:
Biau, D. J., Jolles, B. M., & Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research®, 468(3), 885-892.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Fisher, R.A. (1925). Statistical methods for research workers. London: Oliver and Boyd.
Rao, V.N.V. (2019, March 23). The p-value controversy [Blog post]. Retrieved from https://statisticaljourneys.home.blog/2019/03/23/the-p-value-controversy/
Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. Journal of graduate medical education, 4(3), 279-282.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p< 0.05”. The American Statistician, 73 (S1), 1-19.
Zieffler, A. (2019). Deprecating statistical significance: Toward better science [Lecture slides]. Retrieved from: http://www.datadreaming.org/post/2019-04-26-slhs-prosem/.