# The Metropolis-Hastings Algorithm: a Tutorial

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), but I will not bother proving why here). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

# The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises $P(class | input)$.

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises $P(input | class)*P(class)$.

All you need to build a Bayes classifier is a dataset that allows you to empirically measure $P(class)$ and $P(input | class)$ for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased $P(class | input)$ as $P(input | class)*P(class)$, the denominator $P(input)$ is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value $x$ that maximises the function $f(x) = -x*x$. This is the same value of $x$ that maximises the function $g(x) = f(x)/5$, simply because the denominator, 5, is independent of the value of $x$. We are not interested in the actual output of $f(x)$ or $g(x)$, merely the value of $x$ that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure $P(input | class)$ for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.