# The Metropolis-Hastings Algorithm: a Tutorial

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), but I will not bother proving why here). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

# Data Science vs Data Analysis vs Data Mining: What’s the Difference?

This is a question that I often get asked by people new to data science. Because these are subjective, evolving terms, this question will never have a definitive answer. However, I think of it like this:

Data analysis is literally just the act of drawing an inference from some data. Something as simple as looking at a set of 10 numbers and calculating their average can constitute data analysis.

Data mining is, most generally, when the act of data analysis is partially or fully-automated. Data mining is strongly associated with large datasets, which you would expect, given that the ability to automate analysis is particularly useful with large datasets.

Data science is the most nebulous and vague term of the three. It’s better to think of data science as a craft, rather than a specific activity. The ultimate aim of a data scientist is simply to draw inferences from data; in that sense they are simply data analysts. But a data scientist is also equipped with the knowledge and skills to manage this process from end to end:

1. to gather the data, and store and process it until it is in a form suitable for analysis,
2. to perform the analysis, and
3. to present the results of the analysis in a manner useful to the person who needs it.

Much of the reason that data science has emerged as a separate entity is because of the transition of data analysis from data-poor to data-rich. The transition has been extremely swift. People who were trained extensively to perform steps 2 and 3, because they were trained to work in a world where those steps were the bottleneck, are now choked by their inability to do step 1 well, simply because of the sheer volume, variety, and velocity of the data. Conventional data processing methods simply do not scale to data-rich environments. It is common knowledge in the industry that in data analysis, 90% of the time is spent preparing the data, and 10% of the time is spent doing actual science. These figures are not exaggerated.

Data scientists can not only do all steps 1-3, but importantly should be able to do them in a way that scales, such that the human effort is redistributed more effectively between the steps. This is one of the best ways to tell whether you have hired a true data scientist, or merely a statistician pretender.