In this post I’d like to describe an issue that is almost never addressed in statistics courses, but should be, because it causes a lot of mistaken inferences. It is an issue so pervasive that I routinely see papers published in refereed journals that make this mistake. So if you can’t bother to read through the rest of this post, there are three things you must take away.
- A non-significant result does not allow you to “accept” the null hypothesis.
- A high statistical power does not allow you to “accept” the null hypothesis.
- If you find yourself wanting to “prove” the null hypothesis when you are testing whether one variable affects another in a meaningful way, the proper way to do it is through equivalence testing.
Important thing #1. A non-significant result does not allow you to accept the null hypothesis.
This is a ridiculously common mistake. Suppose you’re comparing the heights of a group of men against the heights of a group of women using a t-test. The t-test spits out a p-value of 0.3, which is higher than your chosen significance level 0.05. Surely this means that the null hypothesis, which is that the group means are equal, is true, right?
Wrong!
Okay, what if the t-test spits out a p-value of 0.99, this means that there is a 99% chance that the group means are equal right?
Wrong!
If your p-value is greater than your significance level, you cannot conclude that the group means are equal. You can only conclude that your data does not refute the hypothesis that your group means are equal. The p-value is probability of the t-test statistic being at least as extreme as the one you observe, assuming the group means are equal.
The logic we engage in with null hypothesis testing is this:
- If the null were true, we would not observe this data. (N → ¬D)
- We have observed this data. (D)
- Therefore the null is not true. (∴¬N)
This logic is sound because if N were true, D would be false. D is true, therefore N must not be true.
The faulty logic we engage in when we try “accepting” the null is this:
- If the null were true, we would not observe this data. (N → ¬D)
- We have not observed this data. (¬D)
- Therefore the null is true. (∴N)
This logic is unsound because D being false does not allow us to draw any inferences about N. It may be true or false. If D is true, we know that N must not be true either, but there may be many reasons for D being false other than N. This fallacy has a name: affirming the consequent. Consider the following analogy which is exactly identical to our faulty logic:
- If the petrol tank is empty, the car will not move. (N → ¬D)
- The car will not move. (¬D)
- Therefore the petrol tank is empty. (∴N)
Do you see why this is wrong? There might be many other reasons that the car will not start, for example, the ignition may be broken, the wheels may be missing, or the car may have hit a wall. However, it is perfectly sound to say that if the car moves (D), then we reject the null hypothesis that the petrol tank is empty (∴¬N).
Bottom line: a non-significant p-value is not evidence of the null.
Important thing #2. A high statistical power does not allow you to “accept” the null hypothesis.
“Okay fine, I get that you can’t use a non-significant p-value to support the null hypothesis.” I hear you say. “But I recall from my stats course that the power of a statistical test is the probability of correctly rejecting the null hypothesis. Surely this means that if my statistical power is pretty high, say 0.95, and my t-test fails to reject the null hypothesis, then there is a 95% chance that there is really no difference between the groups?”
Wrong!
Let us look a little more carefully at the actual definition of the power of a statistical test, and what it is useful for. The power of a statistical test is defined as the probability of rejecting the null hypothesis, given that the null hypothesis is indeed false. Power nearly always depends on (a) the level of statistical significance α at which you wish to reject the null hypothesis, (b) the magnitude M of the effect size of interest, and (c) the size S of the sample. So the power of a t-test is the probability that you observe a difference large enough in your sample S to be significant at the level α, given that there exists a true difference M. This is useful if you want to calculate how large your sample should be to be reasonably confident of detecting a true difference of a certain magnitude (or alternatively, what magnitude of difference you will be reasonably confident of detecting for a certain sample size).
“But that’s what I said!” I hear you say. “So if you’re 99% confident of detecting a true difference, and you fail to detect a true difference, this must mean that there is a high probability that there is no difference, right?”
Wrong!
A high statistical power is just as useless in conclusively telling you anything about whether the null hypothesis is true. Concretely, in the case of a t-test, a high statistical power combined with a non-significant p-value does not allow you to claim that the null hypothesis is true.
To see why this is the case, consider the reasoning that we think we follow when we do this:
- If the null hypothesis were false, then we will reject it. (¬N → R)
- We have not rejected it. (¬R)
- Therefore the null hypothesis is true. (∴N)
This is actually logically sound, and is exactly the same logic we engage in when we do standard null hypothesis significance testing. The problem is that this is an incorrect translation of the problem to logic. In this case, it is not okay to go from the probability ℙ(reject null | null is false) to ℙ(null is false → reject null). The latter is the same as ℙ( ¬ null is false OR reject null), i.e. ℙ(null is true OR reject null). The former is the proportion of tests that detect true effects, whereas the latter is the proportion of tests that detect effects plus the proportion of tests where there genuinely was no effect.
Consider this analogy, courtesy of David Poole and Alan Mackworth:
“Suppose you have a domain where birds are relatively rare, and non-flying birds are a small proportion of the birds. Here P(¬flies | bird) would be the proportion of birds that do not fly, which would be low. P(bird →¬flies) is the same as P(¬bird ∨ ¬flies), which would be dominated by non-birds and so would be high. Similarly, P(bird →flies) would also be high, the probability also being dominated by the non-birds. It is difficult to imagine a situation where the probability of an implication is the kind of knowledge that is appropriate or useful.”
Moreover, since we calculate the power directly based on the significance α we wish to achieve, power is a function of p-value; that is to say, they have a 1:1 relationship. Moreover, they have an inverse relationship: the lower the p-value, the higher the power. Why? Because all else being equal, a test with a higher power should be able to detect an effect with a higher significance level, and therefore a lower p-value.
Bearing this in mind, it is contradictory to use high power as evidence for the null, since a high power corresponds to a low p-value, and we (correctly) use a low p-value as evidence against the null. If you run two tests and one of them has a higher power than the other, it does not provide more evidence for the null because it must also simultaneously have a lower p-value, which is evidence against the null. This fact is explained in much greater detail by Hoenig and Heisey in “The Abuse of Power” (pdf).
Bottom line: a high statistical power is not evidence of the null.
Important thing #3. If you find yourself wanting to “prove” the null hypothesis when you are testing whether one variable affects another in a meaningful way, the proper way to do it is through equivalence testing.
It is often the case that the very thing you want to prove is the absence of an effect. In this situation, you cannot use any test that assumes the absence of an effect as its null hypothesis. As the core of null hypothesis significance testing is proof by contradiction, you need to use a test that assumes the presence of an effect, and then show that the observed data is very unlikely under that assumption.
I will not attempt to describe these techniques in detail in this post, except to say that they are generally referred to as “equivalence testing” (a one-sided version of this, which assumes the presence of an effect in a particular direction, is often referred to as a “noninferiority test”). A very common way of doing this is through what is known as a “two one-sided test” or TOST test, explained very thoroughly by David Streiner in “Unicorns Do Exist” (pdf). It basically boils down to picking an “equivalence interval” such that our null hypothesis is “the difference between means is greater than this equivalence interval”. The alternative hypothesis then becomes “the difference between means is smaller than the equivalence interval”, i.e. the difference between means is sufficiently small that we consider them equivalent.
Bottom line: to show the absence of an effect, use a test where the null hypothesis is the presence of the effect.
That’s it!
Thanks for sticking around, and hopefully you have taken away 3 important things about “accepting” the null hypothesis. I urge you to look into equivalence testing in more detail and become comfortable and familiar with its techniques, and encourage your colleagues to be aware of these common fallacies. Additionally, I have to give credit to this wonderful series of blog posts that inspired this one. To conclude, remember:
- A high p-value does not mean the null hypothesis is true
- Neither does high power
- To show the absence of an effect, use equivalence testing
Best of luck!