# The Metropolis-Hastings Algorithm: a Tutorial

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), but I will not bother proving why here). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

# The Kolmogorov-Smirnov Test: an Intuition

The Kolmogorov–Smirnov test (K–S test) tests if two probability distributions are equal.  Therefore, you can compare an empirically observed distribution with a known reference distribution, or you can compare two observed distributions, to test whether they match.

It works in really quite a simple manner. Let the cumulative distribution functions of the two distributions be CDFA and CDFB respectively. We simply measure the maximum difference between these two functions for any given argument. This maximum difference is known as the Kolmogorov-Smirnov statistic, D, and is given by:

$D = \max_x{(| CDF_A(x) - CDF_B(x) |)}$

You can think about it this way: if you plotted of CDFA and CDFB together on the same set of axes, D is the length of the largest vertical line you could draw between the two plots.

To perform the Kolmogorov-Smirnov test, one simply compares D to a table of thresholds for statistical significance. The thresholds are calculated under the null hypothesis that the distributions are equal. If D is too big, the null hypothesis is rejected. The threshold for significance depends on the size of your sample (as your sample gets smaller, your D needs to get larger to show that the two distributions are different) and, of course, on the desired significance level.

The test is non-parametric or distribution-free, which means it makes no assumptions about the underlying distributions of the data. It is useful for one-dimensional distributions, but does not generalise easily to multivariate distributions.

# How To Generate Any Probability Distribution

In this post I’d like to briefly describe one of my favourite algorithmic techniques: inverse transform sampling. Despite its scary-sounding name, it is actually quite a simple and very useful procedure for generating random numbers from an arbitrary, known probability distribution — given random numbers drawn from a uniform distribution. For example, if you had empirically observed from a database that a variable took on some probability distribution, and you wanted to simulate similar conditions, you would need to draw a random variable with that same distribution. How would you go about doing this?

In essence, it is simply a two-step process:

1. Generate a random value x from the uniform distribution between 0 and 1.
2. Return the value y such that = CDF(y), where CDF is the cumulative distribution function of the probability distribution you wish to achieve.

Why does this work? Step 1 picks a uniformly random value between 0 and 1, so you can interpret this as a probability. Step 2 inverts the desired cumulative distribution function; you are calculating y = CDF-1(x), and therefore the returned value y is such that a random variable drawn from that distribution is less than or equal to y with probability x.

Thinking in terms of the original probability density function, we are uniformly randomly choosing a proportion of the area under the curve of the PDF and returning the number in the domain such that exactly this proportion of the area occurs to the left of that number. So numbers in the regions of the PDF with greater areas are more likely to occur. The uniform distribution is thereby projected onto this desired PDF.