Probability versus Likelihood

I sort of understand that you can’t state a probability for whether or not the population mean is in the interval because it either is or it isn’t (or at least I read that, but I don’t think I really understand it). If there is a 90% chance that the interval contains the population mean, I’m not really clear as to why this is not also the probability because doesn’t that mean there is a 10% chance that it does not contain the population mean, giving you a 0.9 probability that it does contain it? 

One of my students

The answer, somewhat unfortunately/unintuitively/confusingly lies in semantics. The population mean is some value – we just dont know what it is. Therefore, there’s no random chance about it. It straight up is equal to some value. And here’s where more confusion comes into play. 

Probability and Likelihood are two different concepts. They’re basically opposite directions on the same two-way street. When we know population parameters and we’re examining outcomes from the population, that is probability. When we know a sample and we’re examining population parameters, that is likelihood. So, you can say something like ‘given that the mean of a distribution is 15, the probability that a random sample of size 12 has a sample mean greater than 18 is 0.342 (made up number)’. And, you can say something like, ‘given that we observed a sample mean of 42 from 20 observations, the likelihood that the true population mean is greater than 40 is 0.712 (made up number)’. The difference is super subtle, and I’ll admit, probably doesn’t really matter at the end of the day, and the authors of the textbook pull a tricky one by using the word sure and contrasting it with probability, without making the difference explicit (which it absolutely intuitively is not). But, this is why if we’re going by the book, we can’t say something like ‘theres a 90% chance the true mean is in our interval’, because its not even in the realm of probability to discuss the behaviour of population parameters given observed samples. We have to say things like ‘its 90% likely that the true mean is in our interval’. 

Again, I admit, does this distinction really make a difference, a practically significant difference? No, probably not (no pun intended). But if you understand that using population parameters to describe the behaviour of samples and using observed samples to talk about population parameters are two different things, kind of like opposite directions on the same two-way street, then you’re in good shape. It’s just a matter of knowing which side of the street you’re driving on, and what the specific verbiage/jargon is to use on that side of the street.

The p-value controversy – Part 2

I recently attended a conference where one of the plenary sessions included a discussion of the p-value controversy, and provided suggestions on what we should be teaching and using instead.

When I first spoke about the p-value controversy on this blog (Rao, 2019), I suggested that an understanding of the history of hypothesis testing may help clear students’ confusion. To Fisher, p-values were a measure of likelihood that a current hypothesis or theory could explain observed phenomenon. Only when a hypothesis was so unlikely to be a sufficient explanation of a phenomenon could it be ruled out or rejected. It was with this lens that Fisher described p-values less than 0.05 as significant, choosing 0.05 as a convenient threshold (Fisher, 1925).

Wasserstein, Schirm, and Lazar (2019) promote many alternate statistics, one of which is the effect size. They advise that careful consideration be made to determine what a meaningful effect size would be for each individual study.

However, I am concerned that we are setting ourselves up for an effect-size controversy in the future, similar to the p-value controversy. Cohen’s d (Cohen, 1988) is one of the most famous measures of effect size, and comes with convenient thresholds for small, medium, and large effects. Sullivan & Feinn (2012) summarize many different measures of effect size, and for all, include thresholds for determining the size of the effect.

If 0.2 is a small effect, and 0.5 is a medium effect, what is 0.35? Is 0.1 still a small effect, or is it no effect? What about 0.05?

In my mind, these alternate procedures do not solve the problem at the root of the p-value controversy, it just re-directs it. I still don’t know of any perfect solution, but I believe Wasserstein, Schirm, and Lazar’s (2019) recommendation to be thoughtful is what we must seek to impart on our students.

My favourite summary of recommendations thus far is one by Dr Andrew Zieffler (citation listed below). I plan on requiring my students to review these slides after reading the Biau, Jolles, & Porcher (2010) article, in an attempt to help set the stage for thoughtful interaction with statistical tools.

References and further reading:

Biau, D. J., Jolles, B. M., & Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research®, 468(3), 885-892.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.

Fisher, R.A. (1925). Statistical methods for research workers. London: Oliver and Boyd.

Rao, V.N.V. (2019, March 23). The p-value controversy [Blog post]. Retrieved from

Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. Journal of graduate medical education4(3), 279-282.

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p< 0.05”. The American Statistician, 73 (S1), 1-19.

Zieffler, A. (2019). Deprecating statistical significance: Toward better science [Lecture slides]. Retrieved from:

Correlation may be Causation

A common adage recited by many statistics instructors is that correlation is not causation. I am concerned that many students hear the adage, misinterpret it, and leave thinking that correlation, and by extension all of statistics, is useless in the real world.

A purer wording is that correlation does not imply causation, and is important to state because causation implies correlation. My guess is that we don’t say this in classrooms because we believe students don’t yet understand what imply means.

I believe a better way to translate this would be to say Correlation may be causation, but it also may not be. So, what can be done to ensure correlation is a sign of causation?

One answer lies in introducing students to causal inference (Neyman, 1923; Rubin, 1974). Causal inference fundamentally differs from traditional statistical inference. While statistical inference is limited to inferring relationships that exist under the specific conditions through which data were collected, causal inference explicitly attempts to infer relationships amidst changing conditions (Pearl, 2009).

Causal inference is still gaining early traction among the statistics education community. Often times, teachers are unfamiliar with it, and thus hesitate to teach it. I believe that teaching students counterfactual reasoning through causal inference is an important skill to teach as part of teaching statistical literacy (Gal, 2002).

One key visual tool in causal inference is the exploration of relationships between variables using a directed acyclic graph (DAG).

For example, we may think that a person’s weight affects to their blood cholesterol level.

But perhaps we realize that diet may also affect both characteristics.

Here, diet is explicitly acknowledged as a confounding variable. It must be considered in any research design attempting to speak to the relationship between weight and cholesterol level. Furthermore, we have a clear visual to help students understand why weight and cholesterol may be correlated, but may not have a causal relationship.

Unfortunately, I have not yet found a simple explanation I feel is appropriate for introductory students. This may be a future project of mine. In the meantime, I personally enjoyed reading about the history of causal inference in a paper by Freedman (1999), the first few sections of which may be accessible and informative.

References and further reading:

Freedman, D. (1999). From association to causation: some remarks on the history of statistics. Journal de la société française de statistique140(3), 5-32.

Neyman, J. 1923 [1990]. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical Science, 5 (4), 465–472. Trans. Dorota M. Dabrowska and Terence P. Speed.

Pearl, J. (2009). Causal inference in statistics: An overview. Statistics surveys, 3, 96-146.

Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66, 688-701 .

Anyone can do statistics

My favourite pixar movie is Ratatouille. Chef Gusteau’s motto is “tout le monde peut cuisiner”, or, “anyone can cook”. Later in the movie, Anton Ego, voiced by Peter O’Toole, writes “Not everyone can become a great artist, but a great artist can come from anywhere.”

Similarly, I believe that anyone can do statistics, or, that not everyone can become a statistician, but a statistician can come from anywhere.

I recently attended a talk where a prominent statistician and data analyst told the story of how she came into the field. It was entirely due to her instructors explicitly encouraging her and telling her that she has potential in the field.

While not all of our students will become statisticians, or major in statistics, or even master the material we attempt to teach them, we must encourage them, keep an open mind, and instill in them a positive identity as statisticians or data scientists/analysts (Cobb & Hodge, 2002). Our students will never believe in themselves if we do not believe in them first.

I came home from the talk and sent an email to my students from last semester to encourage them to consider furthering their skills, to explicitly state that I believe they have potential as data analysts, and to attempt to instill in them an identify as a statistician.

My message to them: Not only can you do statistics, you have done statistics. You are a statistician.

References and further reading:

Cobb, P., & Hodge, L. (2002). Learning, identity, and statistical data analysis. In Sixth International Conference on Teaching Statistics (ICOTS6), Cape Town, South Africa.

Lewis, B. (Producer), & Bird, B. (Director). (2007). Ratatouille [Motion Picture]. United States: Pixar.

My favourite number

On this blog’s homepage I state “I’ve been in love with numbers for as long as I can remember.” Even when I was a toddler I never wanted to practice reciting the alphabet – I preferred reciting numbers.

Yet, one number has always stood above the rest. It is my favourite number – 7.


I think I decided that 7 ought to be my favourite number when I was relatively young. I was 7 years old when my sister was born (technically, 6yrs 10mos 23days 17hrs and 39mins, and yes I did do that calculation the day she was born itself). 7 was the first jersey number I had for soccer. 7 is the sum of the digits of my birth date.

However, I remember recognizing quite early on, no later than the age of 8, that the reciprocal of 7 was the most interesting reciprocal of all of the numbers less than 12.

  • 1/2 = 0.5
  • 1/3 = 0.333333…
  • 1/4 = 0.25
  • 1/5 = 0.2
  • 1/6 = 0.166666…
  • 1/7 = 0.142857142857….
  • 1/8 = 0.125
  • 1/9 = 0.1111111…
  • 1/10 = 0.1
  • 1/11 = 0.090909…
  • 1/12 = 0.083333…

The reciprocals of 2, 4, 5, 8, and 10 all have finite decimal expansions. The reciprocals of 3, 6, 9, and 12 all end with a single digit repeating ad infinitum while the reciprocal of 11 ends with a repeating two-digit sequence. Yet 1/7 was in a class of its own. I didn’t know why 7 should have such a unique decimal expansion at the time, but I was captivated by it.

A growing fancy

A few years later I realized that the pattern went deeper. Comparing 1/3 (0.333…) to 2/3 (0.666…), the decimal expansions have the same form, in that it is a single digit that repeats, but the digit that repeats is different. This holds for the fractions of 6, 9, 11, and 12 as well. However, the fractions of 7 do something entirely different:

  • 1/7 = 0.142857142857…
  • 2/7 = 0.2857142857…
  • 3/7 = 0.42857142857…
  • 4/7 = 0.57142857…
  • 5/7 = 0.7142857…
  • 6/7 = 0.857142857…

I’ve aligned the decimal expansions to help identify the pattern. Each of the fractions has the same six numbers in the same order, just with a different starting point! 7 was simply outlapping the other numbers in terms of mystique. Why were the same numbers repeating for each fraction, and why were they in the same order?

I began my search for other numbers that had this pattern, but, only using pen and paper, or calculators that only displayed 8 to 10 digits, proved limiting in my search.

Getting serious with number theory

I held on to this intrigue with the number 7 into college, and approached the professor of my number theory class with the question ‘why does 1/7 have such unique patterns?’. He then showed me a whole new dimension to the number 7. It was as if I was peering through Lewis Carroll’s looking glass into a hitherto unknown world of exotic beauty.

He explained that the pattern occurred because we operate in the numeric base of 10, and 10 is a primitive root modulus 7. That means that a series of 6 9’s in a row is divisible by 7, (important since 6 = 7-1) i.e. 999999/7 is an integer, and no other smaller series of 9’s divided by 7 is so (9/7 is not an integer, 99/7 is not an integer, etc.). Any prime number that has this property is called a full repetend prime, and all full repetend primes exhibit the same properties that I described above for the number 7. The first five full repetend primes are 7, 17 (meaning that a series of 16 9’s in a row is the smallest series of 9’s that is divisible by 17), 19, 23, and 29.

He then showed me the property of 9’s, also called Midy’s Theorem. If we recall the repetend of 1/7, i.e. the part that repeats, or 142857:

  • 1+4+2+8+5+7 will be divisible by 9 (it equals 3*9)
  • 14+28+57 will be divisible by 99 (it equals 99)
  • 142+857 will equal 999

Similarly, for the repetend of 1/17, 0588235294117647:

  • 0+5+8+8+2+3+5+2+9+4+1+1+7+6+4+7 will be divisible by 9 (it equals 8*9)
  • 05+88+23+52+94+11+76+47 will be divisible by 99 (it equals 4*99)
  • 0588+2352+9411+7647 will be divisible by 9999 (it equals 2*9999)
  • 05882352+94117647 will equal 99999999

All full repetend primes have this property. We went on to discuss many other things, including discrete logarithms and other properties of cyclic numbers and prime reciprocals, and I went on to discover and play with subclasses of the full repetend primes, but I never lost any love or interest for my favourite number, 7.

Teaching sampling variability

To my mind, the most important foundational concept to statistical inference is an appreciation of sampling variability. Chance, del Mas, & Garfield (2004) lay out a vision of what students need to understand and what they should be able to do with that understanding. However, I wouldn’t reach as far as they did. I believe that the core understanding rests on only a small subset of their list.

The main understanding students should have, in my opinion, is that given a population parameter, some values of a sample statistic are more or less likely to than others to be the result of a sample from that population. This should manifest in a student’s ability to make statements about how far a sample statistics is likely to vary from a population parameter, and vice versa.

Developing such an understanding in students is no trivial matter. There seems to be consensus in the statistics education research community that the use of simulations can help develop students’ understanding of sampling variability (Garfield, et. al., 2008).

I particularly like an activity designed by Scheaffer, et. al. (1996) called What is a confidence interval anyway?. The instructor resources presents a scatterplot relating population proportions to their likely sample proportions (Figure 8, page 274).

Printed below is an adaptation of this scatterplot demonstrating how a student might use it to determine that the likely values of a population proportion are between approximately 65% and 75% after determining that their sample proportion is 0.70 from a sample of size 100.

I particularly like this tool as I believe it helps to frame the idea of inference quite nicely. We never know what the true population parameters are. However, the theory of sampling distributions tells us something about how sample statistics behave in relation to those parameters.

Each of the vertical bars represent the likely sample proportions we might get when we sample from a population with the given population proportion. When we take only one sample sample, we can never know for sure the exact value of the population parameter, but certain options become to look increasingly unlikely. Use of this scatterplot may guide students into a more multiplicative conception of a sample (Saldanha & Thompson, 2002).

I believe such an activity can help improve students’ ability to make statements about how far a sample statistics are likely to vary from a population parameter, and vice versa. However, by only focusing on this one learning objecting, as opposed to full list of recommendations by Chance, et. al. (2004), would I be doing a disservice to our students in their future work and studies in statistics, or will this indeed provide a sufficient foundation for them to become statistically literate?

References and further reading:

Chance, B., del Mas, R., & Garfield, J. (2004). Reasoning about sampling distribitions. In The challenge of developing statistical literacy, reasoning and thinking (pp. 295-323). Springer, Dordrecht.

Garfield, J. B., Ben-Zvi, D., Chance, B., Medina, E., Roseth, C., & Zieffler, A. (2008). Learning to Reason About Statistical Inference. In Developing Students’ Statistical Reasoning (pp. 261-288). Springer, Dordrecht.

Saldanha, L., & Thompson, P. (2002). Conceptions of sample and their relationship to statistical inference. Educational studies in mathematics, 51(3), 257-270.

Scheaffer, R.L., Watkins, A., Gnanadesikan, M., Witmer, J.A. (1996). What Is a Confidence Interval Anyway? In Activity-Based Statistics: Instructor Resources (pp. 274-278). Springer, New York, NY.

Visualizing Data

I’m often asked by colleagues and students, “what makes a good data visualization?”. I believe data visualizations (and any data analysis) are story telling tools. As such, I have two criteria when I create or evaluate them:

  • (1) The visualization should only have one main pattern it tries to convey. There is a tendency to create very complex and multi-faceted visualizations. However, stories are told one plot point at a time. Similarly, visualizations should each focus on one pattern at a time within a larger narrative.
  • (2) Ignoring all labels and text, the pattern should be immediately noticeable. The value of the visualization is in serving as an aide to highlight an otherwise obscure pattern. Thus, a good aide should make the pattern painstakingly obvious and universally evident.

This may seem limiting, but I believe that even complex patterns can be elicited from good data visualizations within a few seconds. TED talks by Hans Rosling and David McCandless excellently demonstrate this power.

One of my favourite examples, and one of the most famous early examples of a data visualization, is by Florence Nightingale (read more here). Her visualization, printed below, helped showcase the need for hospital sanitation.

Let’s examine Florence Nightingale’s graph under my two criteria:

Florence Nightingale’s Data Visualization

Ignoring all text, I immediately see a lot of blue. That blue area represents the number of deaths from preventable or mitigable diseases. This indeed is the pattern Florence Nightingale wished to highlight – a lot of soldiers were dying unnecessarily. Thus, this is an effective visualization of the data and helps tell a story.

Florence Nightingale’s Data Table

Although we could have also examined a data table to arrive at the same conclusion, it is much less dramatic and perhaps a harder pattern to see. This is the power of a good data visualization – it presents patterns in numbers with colours and shapes, broadening prospective audiences’ ability to consume the information at hand.

We owe a great debt to innovators like Florence Nightingale and can strive to exemplify her creativity when we present the stories contained in our data sets.

I like to imagine the conversation Florence Nightinghale had after she created this visualization went something like this:

  • General: I don’t see why we need to waste time and money cleaning the hospital
  • Florence: Do you see all that blue?
  • Army General: Yea…
  • Florence: That’s how many soldiers we could have saved if we had clean hospitals
  • General: We’ve got to do something about that blue! How much time and money do you need to clean the hospital?

The p-value controversy

I was first introduced to the p-value controversy by an epidemiologist in 2014. The controversy is about the use and practice of null hypothesis significance testing (Tramifow, 2014; Wasserstein, 2016; Wellek, 2017). It is essentially due to the prevalence of the practice of mindless hypothesis testing procedures, also known as the null ritual (Gigerenzer, 2004).

Why does the controversy exist? Ultimately, it’s because statistical reasoning and inference, as well as the hypothesis testing procedure, is difficult to understand (del Mas, 2004). So widespread is confusion that the null ritual has even been labelled as tyrannical (England, 1991; Stang, Pool, & Kuss, 2010).

There’s so much confusion that even the controversy itself can be misunderstood as a indictment of p-values and the hypothesis testing procedure, whereas it is simply a recommendation that one should be thoughtful and not use statistical tools blindly (Wasserstein, Schirm, & Lazar, 2019).

Surely, one part of the problem, and one part of any solution, is the statistics classroom. If people fail to understand statistics, is it not the responsibility of statisticians, as stewards of the field, to help remedy the situation?

I think understanding the evolution and origins of the p-value in hypothesis testing can go a long way in helping. Most people would find it surprising to hear that the three people credited with its development would likely balk at its current practice (Gigerenzer, 2004).

There are many articles that discuss the origins of the procedure, but, one of my favourites is by Biau, Jolles, & Porcher (citation listed below). I strongly recommend everyone to read that paper (and require it of my students) – it is short and very accessible – and at the very least, should prove illuminating.

References and Further Reading:

Biau, D. J., Jolles, B. M., & Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research®, 468(3), 885-892.

del Mas, R. C. (2004). A comparison of mathematical and statistical reasoning. In The challenge of developing statistical literacy, reasoning and thinking (pp. 79-95). Springer, Dordrecht.

England, C. (1991). On the tyranny of hypothesis testing in the social sciences. Contemporary psychology, 36(2), 102-105.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Trafimow, D. (2014). Editorial. Basic and Applied Social Psychology, 36(1), 1-2.

Stang, A., Poole, C., & Kuss, O. (2010). The ongoing tyranny of statistical significance testing in biomedical research. European journal of epidemiology, 25(4), 225-230.

Wasserstein, R. L. (2016). ASA statement on statistical significance and P-values.

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p< 0.05”. The American Statistician, 73 (S1), 1-19.

Wellek, S. (2017). Author response to the contributors to the discussion on “A critical evaluation of the current ‘p‐value controversy’”. Biometrical Journal, 59(5), 897-900.