# Setwise Comparison: a faster, more consistent way to make judgements

I originally wrote this post in 2016 for the Sparrho blog.

Have you ever wondered whether doctors are consistent in their judgements? In some cases, they really aren’t. When asked to rate videos of patients with multiple sclerosis (a disease that causes impaired movement) on a numeric scale from 0 being completely healthy to 4 being severely impaired, clinicians struggled to be consistent, often giving the same patient different scores at different times, and disagreeing amongst themselves. This difficulty is quite common, and not unique to doctors — people often have to assign scores to difficult, abstract concepts, such as “How good was a musical performance?” or “How much do you agree or disagree with this statement?” Time and time again, it has been shown through research that people are fundamentally inconsistent at this type of activity, no matter the setting or level of expertise.

The field of ‘machine learning’, which can help to automate such scoring (e.g. automatically rating patients according to their disability), is based on the method that we can give the computer a set of examples for which the score is known, in the hope that the computer can use these to ‘learn’ how to assign scores to new, unseen examples. But if the computer is taught from examples where the score is inconsistently assigned, the result is that the computer learns to assign inconsistent, unusable scores to new, unseen examples.

To solve this problem, we brought together an understanding of how humans work with some mathematical tricks. The fundamental insight is that it is easier and more consistent for humans to provide preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”). The problem is, even if you have as few as 50 items to assign scores, you already have 50 x 49 = 2450 ways of pairing them together. This balloons to nearly 10,000 comparisons when you have 100 items. Clearly, this doesn’t scale. So we scale this using a mathematical insight: namely, that if you’ve compared A to B, and B to C, you can guess with reasonably high accuracy what the relationship is between A and C. This ‘guessing’ is done with a computer algorithm called TrueSkill, which was originally invented to help rank people playing multiplayer games by their skill, so that they could be better matched to online opponents. Using TrueSkill, we can reduce the number of comparisons required by a significant amount, so that increasing the number of items no longer results in a huge increase in comparisons. This study has advanced our understanding of how people quantify difficult concepts, and has presented a new method which balances the strengths of people and computers to help people efficiently and consistently provide scores to many items.

### Why is this important for researchers in fields other than computer vision?

This study shows a new way to quickly and consistently have humans rate items on a continuous scale (e.g. “rate the happiness of the individual in this picture on a scale of 1 to 5”). It works through the use of preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”), combined with an algorithmic ranking system which can reduce the need to compare every item with every other item. This was initially motivated by the need to have higher-quality labels for machine learning systems, but can be applied in any domain where humans have difficulty placing items along a scale. In our study we showed that clinicians can use our method to achieve far higher consistency than was previously thought possible in their assessment of motor illness.

We built a nifty tool to help clinicians perform Setwise Comparison, which you can see in the video below: https://www.youtube.com/watch?v=Q1hW-UXU3YE

### Why is this important for researchers in the same field?

This study describes a novel method for efficiently eliciting high-consistency continuous labels, which can be used as training data for machine learning systems, when the concept being labelled has unclear boundaries — a common scenario in several machine learning domains, such as affect recognition, automated sports coaching, and automated disease assessment. Label consistency is improved through the use of preference judgements, that is, labellers sort training data on a continuum, rather than providing absolute value judgements. Efficiency is improved through the use of comparison in sets (as opposed to pairwise comparison), and leveraging probabilistic inference through the TrueSkill algorithm to infer the relationship between data which have not explicitly been compared. The system was evaluated on the real-world case study of clinicians assessing motor degeneration in multiple sclerosis (MS) patients, and was shown to have an unprecedented level of consistency, exceeding widely-accepted clinical ‘gold standards’.

If you’re interested in learning more, we reported this research in detail in the following publications:

Setwise Comparison: Consistent, Scalable, Continuum Labels for Machine Learning
Advait Sarkar, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D’Souza, Peter Kontschieder, Samuel Rota Bulò, Lorcan Walsh, Christian P. Kamm, Yordan Zaykov, Abigail Sellen, Siân E. Lindley
Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI 2016) (pp. 261–271)

Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support – a proof of concept study
Saskia Steinheimer, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi, Jessica Burggraaff, Peter Kontschieder, Frank Dahlke, Abigail Sellen, Bernard M. J. Uitdehaag, Ludwig Kappos, Christian P. Kamm
Disability and Rehabilitation, 2019
(This was a writeup of our 2016 CHI paper for a medical audience)

# How To Generate Any Probability Distribution, Part 2: The Metropolis-Hastings Algorithm

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), and there is a proof of this which I am omitting). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

There is one snag with Metropolis-Hastings: it might be too slow for some applications, because it can need quite a lot of samples before the generated distribution starts to match the desired distribution. One improvement is called Hamiltonian Monte Carlo. Instead of jumping in a random direction according to a normal distribution, think of being a ball rolling around the hilly area — as it goes down slopes, it rolls faster and gathers momentum, which it loses when it climbs up slopes. In practice, Hamiltonian Monte Carlo achieves a better approximation of the desired distribution in many fewer samples than Metropolis-Hastings.

# The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises $P(class | input)$.

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises $P(input | class)*P(class)$.

All you need to build a Bayes classifier is a dataset that allows you to empirically measure $P(class)$ and $P(input | class)$ for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased $P(class | input)$ as $P(input | class)*P(class)$, the denominator $P(input)$ is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value $x$ that maximises the function $f(x) = -x*x$. This is the same value of $x$ that maximises the function $g(x) = f(x)/5$, simply because the denominator, 5, is independent of the value of $x$. We are not interested in the actual output of $f(x)$ or $g(x)$, merely the value of $x$ that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure $P(input | class)$ for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.