Why you can’t accept the null hypothesis

In this post I’d like to describe an issue that is almost never addressed in statistics courses, but should be, because it causes a lot of mistaken inferences. It is an issue so pervasive that I routinely see papers published in refereed journals that make this mistake. So if you can’t bother to read through the rest of this post, there are three things you must take away.

1. A non-significant result does not allow you to “accept” the null hypothesis.
2. A high statistical power does not allow you to “accept” the null hypothesis.
3. If you find yourself wanting to “prove” the null hypothesis when you are testing whether one variable affects another in a meaningful way, the proper way to do it is through equivalence testing.
Important thing #1. A non-significant result does not allow you to accept the null hypothesis.

This is a ridiculously common mistake. Suppose you’re comparing the heights of a group of men against the heights of a group of women using a t-test. The t-test spits out a p-value of 0.3, which is higher than your chosen significance level 0.05. Surely this means that the null hypothesis, which is that the group means are equal, is true, right?

Wrong!

Okay, what if the t-test spits out a p-value of 0.99, this means that there is a 99% chance that the group means are equal right?

Wrong!

If your p-value is greater than your significance level, you cannot conclude that the group means are equal. You can only conclude that your data does not refute the hypothesis that your group means are equal. The p-value is probability of the t-test statistic being at least as extreme as the one you observe, assuming the group means are equal.

The logic we engage in with null hypothesis testing is this:

1. If the null were true, we would not observe this data. (N → ¬D)
2. We have observed this data. (D)
3. Therefore the null is not true. (∴¬N)

This logic is sound because if N were true, D would be false. D is true, therefore N must not be true.

The faulty logic we engage in when we try “accepting” the null is this:

1. If the null were true, we would not observe this data. (N → ¬D)
2. We have not observed this data. (¬D)
3. Therefore the null is true. (∴N)

This logic is unsound because D being false does not allow us to draw any inferences about N. It may be true or false. If D is true, we know that N must not be true either, but there may be many reasons for D being false other than N. This fallacy has a name: affirming the consequent.  Consider the following analogy which is exactly identical to our faulty logic:

1. If the petrol tank is empty, the car will not move. (N → ¬D)
2. The car will not move. (¬D)
3. Therefore the petrol tank is empty. (∴N)

Do you see why this is wrong? There might be many other reasons that the car will not start, for example, the ignition may be broken, the wheels may be missing, or the car may have hit a wall. However, it is perfectly sound to say that if the car moves (D), then we reject the null hypothesis that the petrol tank is empty (∴¬N).

Bottom line: a non-significant p-value is not evidence of the null.

Important thing #2. A high statistical power does not allow you to “accept” the null hypothesis.

“Okay fine, I get that you can’t use a non-significant p-value to support the null hypothesis.” I hear you say. “But I recall from my stats course that the power of a statistical test is the probability of correctly rejecting the null hypothesis. Surely this means that if my statistical power is pretty high, say 0.95, and my t-test fails to reject the null hypothesis, then there is a 95% chance that there is really no difference between the groups?”

Wrong!

Let us look a little more carefully at the actual definition of the power of a statistical test, and what it is useful for. The power of a statistical test is defined as the probability of rejecting the null hypothesis, given that the null hypothesis is indeed false. Power nearly always depends on (a) the level of statistical significance α at which you wish to reject the null hypothesis, (b) the magnitude M of the effect size of interest, and (c) the size S of the sample. So the power of a t-test is the probability that you observe a difference large enough in your sample S to be significant at the level α, given that there exists a true difference M. This is useful if you want to calculate how  large your sample should be to be reasonably confident of detecting a true difference of a certain magnitude (or alternatively, what magnitude of difference you will be reasonably confident of detecting for a certain sample size).

“But that’s what I said!” I hear you say. “So if you’re 99% confident of detecting a true difference, and you fail to detect a true difference, this must mean that there is a high probability that there is no difference, right?”

Wrong!

A high statistical power is just as useless in conclusively telling you anything about whether the null hypothesis is true. Concretely, in the case of a t-test, a high statistical power combined with a non-significant p-value does not allow you to claim that the null hypothesis is true.

To see why this is the case, consider the reasoning that we think we follow when we do this:

1. If the null hypothesis were false, then we will reject it. (¬N → R)
2. We have not rejected it. (¬R)
3. Therefore the null hypothesis is true. (∴N)

This is actually logically sound, and is exactly the same logic we engage in when we do standard null hypothesis significance testing. The problem is that this is an incorrect translation of the problem to logic. In this case, it is not okay to go from the probability ℙ(reject null | null is false) to ℙ(null is false → reject null). The latter is the same as ℙ( ¬ null is false OR reject null), i.e. ℙ(null is true OR reject null). The former is the proportion of tests that detect true effects, whereas the latter is the proportion of tests that detect effects plus the proportion of tests where there genuinely was no effect.

Consider this analogy, courtesy of David Poole and Alan Mackworth:

“Suppose you have a domain where birds are relatively rare, and non-flying birds are a small proportion of the birds. Here P(¬flies | bird) would be the proportion of birds that do not fly, which would be low. P(bird →¬flies) is the same as P(¬bird ∨ ¬flies), which would be dominated by non-birds and so would be high. Similarly, P(bird →flies) would also be high, the probability also being dominated by the non-birds. It is difficult to imagine a situation where the probability of an implication is the kind of knowledge that is appropriate or useful.”

Moreover, since we calculate the power directly based on the significance α we wish to achieve, power is a function of p-value; that is to say, they have a 1:1 relationship. Moreover, they have an inverse relationship: the lower the p-value, the higher the power. Why? Because all else being equal, a test with a higher power should be able to detect an effect with a higher significance level, and therefore a lower p-value.

Bearing this in mind, it is contradictory to use high power as evidence for the null, since a high power corresponds to a low p-value, and we (correctly) use a low p-value as evidence against the null. If you run two tests and one of them has a higher power than the other, it does not provide more evidence for the null because it must also simultaneously  have a lower p-value, which is evidence against the null. This fact is explained in much greater detail by Hoenig and Heisey in “The Abuse of Power” (pdf).

Bottom line: a high statistical power is not evidence of the null.

Important thing #3. If you find yourself wanting to “prove” the null hypothesis when you are testing whether one variable affects another in a meaningful way, the proper way to do it is through equivalence testing.

It is often the case that the very thing you want to prove is the absence of an effect. In this situation, you cannot use any test that assumes the absence of an effect as its null hypothesis. As the core of null hypothesis significance testing is proof by contradiction, you need to use a test that assumes the presence of an effect, and then show that the observed data is very unlikely under that assumption.

I will not attempt to describe these techniques in detail in this post, except to say that they are generally referred to as “equivalence testing” (a one-sided version of this, which assumes the presence of an effect in a particular direction, is often referred to as a “noninferiority test”). A very common way of doing this is through what is known as a “two one-sided test” or TOST test, explained very thoroughly by David Streiner in “Unicorns Do Exist” (pdf). It basically boils down to picking an “equivalence interval” such that our null hypothesis is “the difference between means is greater than this equivalence interval”. The alternative hypothesis then becomes “the difference between means is smaller than the equivalence interval”, i.e. the difference between means is sufficiently small that we consider them equivalent.

Bottom line: to show the absence of an effect, use a test where the null hypothesis is the presence of the effect.

That’s it!

Thanks for sticking around, and hopefully you have taken away 3 important things about “accepting” the null hypothesis. I urge you to look into equivalence testing in more detail and become comfortable and familiar with its techniques, and encourage your colleagues to be aware of these common fallacies. Additionally, I have to give credit to this wonderful series of blog posts that inspired this one. To conclude, remember:

1. A high p-value does not mean the null hypothesis is true
2. Neither does high power
3. To show the absence of an effect, use equivalence testing

Best of luck!

How To Generate Any Probability Distribution, Part 2: The Metropolis-Hastings Algorithm

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), and there is a proof of this which I am omitting). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

There is one snag with Metropolis-Hastings: it might be too slow for some applications, because it can need quite a lot of samples before the generated distribution starts to match the desired distribution. One improvement is called Hamiltonian Monte Carlo. Instead of jumping in a random direction according to a normal distribution, think of being a ball rolling around the hilly area — as it goes down slopes, it rolls faster and gathers momentum, which it loses when it climbs up slopes. In practice, Hamiltonian Monte Carlo achieves a better approximation of the desired distribution in many fewer samples than Metropolis-Hastings.

The Kolmogorov-Smirnov Test: an Intuition

The Kolmogorov–Smirnov test (K–S test) tests if two probability distributions are equal. Therefore, you can compare an empirically observed distribution with a known reference distribution, or you can compare two observed distributions, to test whether they match.

It works in really quite a simple manner. Let the cumulative distribution functions of the two distributions be CDFA and CDFB respectively. We simply measure the maximum difference between these two functions for any given argument. This maximum difference is known as the Kolmogorov-Smirnov statistic, D, and is given by:

$D = \max_x{(| CDF_A(x) - CDF_B(x) |)}$

You can think about it this way: if you plotted of CDFA and CDFB together on the same set of axes, D is the length of the largest vertical line you could draw between the two plots.

To perform the Kolmogorov-Smirnov test, one simply compares D to a table of thresholds for statistical significance. The thresholds are calculated under the null hypothesis that the distributions are equal. If D is too big, the null hypothesis is rejected. The threshold for significance depends on the size of your sample (as your sample gets smaller, your D needs to get larger to show that the two distributions are different) and, of course, on the desired significance level.

The test is non-parametric or distribution-free, which means it makes no assumptions about the underlying distributions of the data. It is useful for one-dimensional distributions, but does not generalise easily to multivariate distributions.

How To Generate Any Probability Distribution, Part 1: Inverse Transform Sampling

In this post I’d like to briefly describe one of my favourite algorithmic techniques: inverse transform sampling. Despite its scary-sounding name, it is actually quite a simple and very useful procedure for generating random numbers from an arbitrary, known probability distribution — given random numbers drawn from a uniform distribution. For example, if you had empirically observed from a database that a variable took on some probability distribution, and you wanted to simulate similar conditions, you would need to draw a random variable with that same distribution. How would you go about doing this?

In essence, it is simply a two-step process:

1. Generate a random value x from the uniform distribution between 0 and 1.
2. Return the value y such that = CDF(y), where CDF is the cumulative distribution function of the probability distribution you wish to achieve.

Why does this work? Step 1 picks a uniformly random value between 0 and 1, so you can interpret this as a probability. Step 2 inverts the desired cumulative distribution function; you are calculating y = CDF-1(x), and therefore the returned value y is such that a random variable drawn from that distribution is less than or equal to y with probability x.

Thinking in terms of the original probability density function, we are uniformly randomly choosing a proportion of the area under the curve of the PDF and returning the number in the domain such that exactly this proportion of the area occurs to the left of that number. So numbers in the regions of the PDF with greater areas are more likely to occur. The uniform distribution is thereby projected onto this desired PDF.

This is a really neat algorithm. But what do you do if you don’t know the CDF of the distribution you want to sample from? I discuss a solution in Part 2 of this series: How To Generate Any Probability Distribution, Part 2: The Metropolis-Hastings Algorithm

The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises $P(class | input)$.

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises $P(input | class)*P(class)$.

All you need to build a Bayes classifier is a dataset that allows you to empirically measure $P(class)$ and $P(input | class)$ for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased $P(class | input)$ as $P(input | class)*P(class)$, the denominator $P(input)$ is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value $x$ that maximises the function $f(x) = -x*x$. This is the same value of $x$ that maximises the function $g(x) = f(x)/5$, simply because the denominator, 5, is independent of the value of $x$. We are not interested in the actual output of $f(x)$ or $g(x)$, merely the value of $x$ that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure $P(input | class)$ for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.

Data Science vs Data Analysis vs Data Mining: What’s the Difference?

This is a question that I often get asked by people new to data science. Because these are subjective, evolving terms, this question will never have a definitive answer. However, I think of it like this:

Data analysis is literally just the act of drawing an inference from some data. Something as simple as looking at a set of 10 numbers and calculating their average can constitute data analysis.

Data mining is, most generally, when the act of data analysis is partially or fully-automated. Data mining is strongly associated with large datasets, which you would expect, given that the ability to automate analysis is particularly useful with large datasets.

Data science is the most nebulous and vague term of the three. It’s better to think of data science as a craft, rather than a specific activity. The ultimate aim of a data scientist is simply to draw inferences from data; in that sense they are simply data analysts. But a data scientist is also equipped with the knowledge and skills to manage this process from end to end:

1. to gather the data, and store and process it until it is in a form suitable for analysis,
2. to perform the analysis, and
3. to present the results of the analysis in a manner useful to the person who needs it.

Much of the reason that data science has emerged as a separate entity is because of the transition of data analysis from data-poor to data-rich. The transition has been extremely swift. People who were trained extensively to perform steps 2 and 3, because they were trained to work in a world where those steps were the bottleneck, are now choked by their inability to do step 1 well, simply because of the sheer volume, variety, and velocity of the data. Conventional data processing methods simply do not scale to data-rich environments. It is common knowledge in the industry that in data analysis, 90% of the time is spent preparing the data, and 10% of the time is spent doing actual science. These figures are not exaggerated.

Data scientists can not only do all steps 1-3, but importantly should be able to do them in a way that scales, such that the human effort is redistributed more effectively between the steps. This is one of the best ways to tell whether you have hired a true data scientist, or merely a statistician pretender.

Why Certain Special Characters Reduce The SMS Character Limit To 70

I recently noticed that the character count of a text message I was drafting on my iPhone suddenly changed from “x/160” to “x/70” (here’s how to display a character count in Messages, if you didn’t already know).

Perplexed, I turned to the Internet for an answer, and found one quite quickly on this MacRumors thread.

It basically boils down to this: An SMS may contain up to 140 bytes (= 1120 bits) of data. UK mobile networks use the GSM standard. The basic GSM character set is encoded using 7 bits per character, which allows for a text message to consist of 1120/7 = 160 characters.

It is only possible to represent 128 different characters with 7 bits. This suffices to capture all common English characters. A few additional special characters (mostly punctuation) can be specified using the basic character set “extension”, which requires 14 bits for every character in the extended set.

However, support for the vast majority of foreign language characters comes in the form of 16-bit UTF-16 alphabet. If you have a mix of English and foreign language characters in your text message, the entire message must be sent in UTF-16, which reduces the number of available characters to 1120/16 = 70 characters. This explains the phenomenon I was experiencing.

I know what you’re thinking: this sucks for those who text in languages other than English. Thankfully, the GSM standard has a solution called “national language shift tables”. In this scheme, several 7-bit character sets are recognised, each corresponding to the most commonly used characters of a particular language. The first four bytes of the text message indicate the specific character set to use, and the rest of the message (136 bytes, or 1088 bits) can be used for the actual content of the message, allowing for a respectable compromise of 1088/7 ≅ 155 characters.

Using characters that belong to multiple shift tables in the same text triggers a fallback to UTF-16, but the idea is to capture the vast majority of text communication.

If this has piqued your interest, Wikipedia has a fairly comprehensive article about GSM 03.38.