I recently attended a conference where one of the plenary sessions included a discussion of the p-value controversy, and provided suggestions on what we should be teaching and using instead.

When I first spoke about the p-value controversy on this blog (Rao, 2019), I suggested that an understanding of the history of hypothesis testing may help clear students’ confusion. To Fisher, p-values were a measure of likelihood that a current hypothesis or theory could explain observed phenomenon. Only when a hypothesis was so unlikely to be a sufficient explanation of a phenomenon could it be ruled out or rejected. It was with this lens that Fisher described p-values less than 0.05 as *significant*, choosing 0.05 as a *convenient* threshold (Fisher, 1925).

Wasserstein, Schirm, and Lazar (2019) promote many alternate statistics, one of which is the *effect size*. They advise that careful consideration be made to determine what a meaningful effect size would be for each individual study.

However, I am concerned that we are setting ourselves up for an effect-size controversy in the future, similar to the p-value controversy. Cohen’s d (Cohen, 1988) is one of the most famous measures of effect size, and comes with convenient thresholds for small, medium, and large effects. Sullivan & Feinn (2012) summarize many different measures of effect size, and for all, include thresholds for determining the size of the effect.

If 0.2 is a small effect, and 0.5 is a medium effect, what is 0.35? Is 0.1 still a small effect, or is it no effect? What about 0.05?

In my mind, these alternate procedures do not solve the problem at the root of the p-value controversy, it just re-directs it. I still don’t know of any perfect solution, but I believe Wasserstein, Schirm, and Lazar’s (2019) recommendation to be *thoughtful* is what we must seek to impart on our students.

My favourite summary of recommendations thus far is one by Dr Andrew Zieffler (citation listed below). I plan on requiring my students to review these slides after reading the Biau, Jolles, & Porcher (2010) article, in an attempt to help set the stage for thoughtful interaction with statistical tools.

References and further reading:

Biau, D. J., Jolles, B. M., & Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research®, 468(3), 885-892.

Cohen, J. (1988). *Statistical power analysis for the behavioral sciences*. Hillsdale, NJ: Lawrence Erlbaum Associates.

Fisher, R.A. (1925). *Statistical methods for research workers*. London: Oliver and Boyd.

Rao, V.N.V. (2019, March 23). *The p-value controversy* [Blog post]. Retrieved from https://statisticaljourneys.home.blog/2019/03/23/the-p-value-controversy/

Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. *Journal of graduate medical education*, *4*(3), 279-282.

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p< 0.05”. *The American Statistician, 73* (S1), 1-19.

Zieffler, A. (2019). *Deprecating statistical significance: Toward better science* [Lecture slides]. Retrieved from: http://www.datadreaming.org/post/2019-04-26-slhs-prosem/.