# Tell, don’t show: how to teach AI

Should we teach good behaviour to Artificial Intelligence (AI) through our feedback, or should we try and tell it a set of rules explaining what good behaviour is? Both approaches have advantages and limitations, but when we tested them in a complex scenario, one of them emerged the winner.

## If AI is the future, how will we tell it what we want it to do?

Artificial intelligence is capable of crunching through enormous datasets and providing us assistance in many facets of our lives. Indeed, it seems this is our future. An AI assistant may help you decide what gifts to buy for a friend, or what books to read, or who to meet, or what to do on the weekend. In the worst case, of course, this could be dystopian – AI controls us, and not the other way around, we’ve all heard that story – but in the best case, it could be incredibly stimulating, deeply satisfying, and profoundly liberating.

But an important and unsolved problem is that of specifying our intent, our goals, and our desires, for the AI system. Assuming we know what we want from the AI system (this is not always the case, as we’ll see later), how do we teach the system? How do we help the system learn what gifts might be good for a friend, what books we might like to read, the people we might like to meet, and the weekend activities we care about?

There are many parts to this problem, and many solutions. The solution ultimately depends on the context in which we’re teaching the AI, and the task we’re recruiting it to do for us. So in order to study this, we need a concrete problem. Luckily for me, Ruixue Liu decided to join us at Microsoft for an internship in which she explored a unique and interesting problem indeed. The problem we studied was how to teach an AI system to give us information about a meeting, where for some reason, we can’t see the meeting room.

## Our problem: eyes-free meeting participation

When people enter a meeting room, they can typically pick up several cues: who is in the meeting? Where in the room are they? Are they seated or standing? Who is speaking? What are they doing? Research shows that not having this information can be very detrimental to meeting participation.

Unfortunately, in many modern meeting scenarios, this is exactly the situation we find ourselves in. People often join online meetings remotely without access to video, due to device limitations, poor Internet connections, or because they are engaged in parallel “eyes-busy” tasks such as driving, cooking, or going to the gym. People who are blind or low vision also describe this lack of information as a major hurdle in meetings, whether in-person or online.

We think an AI system could use cameras in meeting rooms to present this information to people who, for whatever reason, cannot see the meeting room. This information could be relayed via computer-generated speech, or special sound signals, or even through haptics. Given that the participant only has a few moments to understand this information as they join a meeting, it’s important that only the most useful information is given to the user. Does the user want to know about people’s locations? Their pose? Their clothes? What information would be useful and helpful for meeting participation?

However, what counts as ‘most useful’ varies from user to user, and context to context. One goal of the AI system is to learn this, but it can’t do so without help from the user. Here is the problem: should the user tell the system what information is most useful, by specifying a set of rules about what information they want in each scenario, or should the user give feedback to the system, saying whether or not it did a good job over the course of many meetings, with the aim of teaching it correct behaviour in the long term?

## Our study, in which we made people attend over 100 meetings

Don’t worry – luckily for the sanity of our participants, these weren’t real meetings. We created a meeting simulator which could randomly generate meeting scenarios. Each simulated meeting had a set of people – we generated names, locations (within the room), poses, whether they were speaking or not, and several other pieces of information. Because we were testing eyes-free meeting participation, we didn’t visualise this information – the objective was for the user to train the system to present a useful summary of this information in audio form.

We conducted a study in which 15 participants used two approaches to ‘train’ the system to relay the information they wanted. One approach was a rule-based programming system, where the participant could specify “if this, then that”-style rules. For example, “if the number of people in the meeting is less than 5, then tell me the names of the people in the meeting”.

The other approach was a feedback-based training system (our technical approach was to use a kind of machine learning called deep reinforcement learning). In the feedback-based training system, the user couldn’t say what they wanted directly, but instead, as they went to various (simulated) meetings, the system would do its best to summarise the information. After each summary, the user provided simple positive/negative feedback, answering “yes” or “no” to the question of whether they were satisfied with the summary.

Each participant tried both systems, one after the other in randomised order. We let participants play around, test and tweak and teach the AI as much as they liked, and try out the system’s behaviour on as many simulated meetings as they liked. Many participants “attended” well over 100 meetings, with two participants choosing to attend nearly 160 meetings over the course of the experiment! Who knew meetings could be such fun!

We asked participants to fill out a few questionnaires about their experience of interacting with both systems, and we conducted follow-up interviews to talk about their experience, too.

## Results

Participants reported a significantly lower cognitive load and higher satisfaction when giving the system rules, than giving feedback. Thus, it was easier and more satisfactory to tell the AI how to behave, than to show it how to behave through feedback.

Rule-based programming gave participants a greater feeling of control and flexibility, but some participants found it hard at the beginning of the experiment to formulate rules from scratch. Participants also found it hard to understand how different rules worked together, and whether conflicting rules had an order of precedence (they did not).

Feedback-based teaching was seen by participants as easier, but much more imprecise. There were instances where the system did something almost correct, but because the user could only say whether the behaviour was good or bad, they did not have the tools to give more nuanced feedback to the system. Moreover, people don’t just know their preferences, they figure them out over time. With feedback-based teaching, participants worried that they were ‘misleading’ the system with poor feedback at the early stages of training, while they were still figuring out what their preferences were.

## Conclusion

Based on our results, we would recommend a rule-based programming interface. But as explained, we found several advantages and limitations to both approaches. In both cases, we found that the first step was for the human to figure out what they wanted from the system! This is hard if the user doesn’t have a clear idea of what the system can and can’t do; our first recommendation is for system designers to make this clear.

Our participants also had a hard time in both cases expressing their preferences exactly: with rules, it was because the rule-based programming language was complex, and with feedback-based teaching, it was because yes/no feedback isn’t precise enough. Our second recommendation is to make clear to users what actions they need to take to specify certain preferences.

Finally, it was difficult for participants to understand the system they finally trained; it was difficult to know what rules would apply in certain scenarios, and they also found the feedback-trained system to be unpredictable. Our third recommendation is to provide more information as to why the system does what it does in certain scenarios.

In the future, we should consider blending the two approaches, to get the best of both worlds. For example, the feedback-based system could be used to generate candidate rules, to help users form a better idea of their preferences, or detect hard-to-specify contexts. Rule-based systems could help define context, explain behaviour learnt by the system, and provide a way for specifying and editing information not captured by the feedback-trained system. We aren’t sure what this might look like, but we’re working on it. Until then, let’s aim to tell, and not show, what we want our AI to do.

Here’s a summary poem:

Yes, no, a little more
What do you want?
I can do this, this, and this
But that I can’t

Tell me and I’ll show you
What you can’t see
I’ll do my best to learn from
What you tell me

Liu, Ruixue, Advait Sarkar, Erin Solovey, and Sebastian Tschiatschek. “Evaluating Rule-based Programming and Reinforcement Learning for Personalising an Intelligent System.” In IUI Workshops. 2019. http://ceur-ws.org/Vol-2327/#ExSS

# People reluctant to use self-driving cars, survey shows

Autonomous vehicles are going to save us from traffic, emissions, and inefficient models of car ownership. But while songs of praise for self-driving cars are regularly sung in Silicon Valley, does the public really want them?

That’s what my student Charlie Hewitt, and collaborators Ioannis Politis and Theocharis Amanatidis set out to study. We decided to conduct a public opinion survey to find out.

However, we first had to solve two problems.

1. When Charlie started his work, there were no existing surveys designed specifically around autonomous vehicles. We had some surveys for technology acceptance in general, and some for cars, which are a good start. So we combined those and introduced some additional information. This resulted in the creation of a new survey designed specifically for autonomous vehicles. We called it the Autonomous Vehicle Acceptance Model, or AVAM for short.
2. When people think of self-driving cars, they generally picture a futuristic pod with no steering wheel or controls, that they just step into and get magically transported to their destination. However, the auto industry differentiates between six levels of autonomy. Previous studies had attempted to get people’s attitudes to each of these levels, but it turns out people can’t picture these different levels of autonomy very well, and don’t understand how they differ. So, Charlie created short descriptions to explain the differences between them. These vignettes are a key part of the AVAM, because they help the general public understand the implications of different levels of autonomy.

Here are the six levels of autonomous vehicles as described in our survey:

• Level 0: No Driving Automation. Your car requires you to fully control steering, acceleration/deceleration and gear changes at all times while driving. No autonomous functionality is present.
• Level 1: Driver Assistance. Your car requires you to control steering and acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. You are required to maintain control of the steering at all times.
• Level 2: Partial Driving Automation. Your car requires you to control steering and  acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. The car can also follow the highway’s lane markings and change between lanes autonomously, but may require you to retake control with little or no warning in emergency situations.
• Level 3: Conditional Driving Automation. Your car can drive partially autonomously on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take control and steer, accelerate/decelerate and switch lanes as appropriate. The car is aware of potential emergency situations, but if it encounters a confusing situation which it cannot handle autonomously then you will be alerted and must retake control within a few seconds. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.
• Level 4: High Driving Automation. Your car can drive fully autonomously only on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take full control and can steer, accelerate/decelerate and switch lanes as appropriate. The car does not rely on your input at all while on the highway. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.
• Level 5: Full Driving Automation. Your car is fully autonomous. You are able to get into the car and instruct it where you would like to travel to, the car then carries out your desired route with no further interaction required from you. There are no steering or speed controls as driving occurs without any interaction from you.

Before you read on, think about each of those levels. What do you think are the advantages and disadvantages of each? Which would you be comfortable with and why?

We sent our survey to 187 drivers recruited from across the USA, and here’s what we found:

## Result 1: our respondents were not ready to accept autonomous vehicles.

We found that on many measures, people report a lower acceptance of higher automation levels. People perceive higher autonomy levels as being less safe, they report lower intent to use them, and higher anxiety with higher autonomy levels.

We compared some of the results with those from an earlier study, conducted in 2014. We had to make some simplifying assumptions, as the 2014 study wasn’t conducted with the AVAM. However, we still found that our results were mostly similar: both studies found that people (unsurprisingly) expected to have to do less as the level of autonomy increased. Both studies also found that people showed lower intent to use higher autonomy vehicles, and poorer general attitude towards higher autonomy. Self-driving cars seem to be suffering in public opinion!

## Result 2: the biggest leap in user perception comes with full autonomy.

We asked people how much they would expect to have to use their hands, feet and eyes while using a vehicle at each level of autonomy. Even though vehicles at the intermediate levels of autonomy (3 and 4) can do significantly more than levels 1 and 2, people did not perceive the higher levels as requiring significantly less engagement. However, at level 5 (full autonomy), there was a dramatic drop in expected engagement. This was an interesting and new finding (albeit not entirely surprising). One explanation for this is that people only really perceive two levels of autonomy: partial and full, and don’t really care about the minor differences in experience with different levels of partial autonomy.

All in all, we were fascinated to learn about people’s attitudes to self-driving cars. Despite the enthusiasm displayed by the tech media, there seems to be a consistent concern around their safety and reluctance to adopt amongst the general public. Even if self-driving cars really do end up being safer and better in many other ways than regular cars, automakers will still face this challenge of public perception.

And now, a summary poem:

The iron beast has come alive,
We do not want it, do not want it
Its promises we do not prize
It does not do as we see fit

Only when we can rely
On iron beast with its own eye
Only then will we concede
And disaffection yield to need

If you’re interested in using our questionnaire or our data, please reach out! I’d love to help you build on our research.

Charlie Hewitt, Ioannis Politis, Theocharis Amanatidis, and Advait Sarkar. 2019. Assessing public perception of self-driving cars: the autonomous vehicle acceptance model. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI ’19). ACM, New York, NY, USA, 518-527. DOI: https://doi.org/10.1145/3301275.3302268

# Human language isn’t the best way to chat with Siri or Alexa, probably

The year is 2019. Voice-controlled digital assistants are great at simple commands such as “set a timer…” and “what’s the weather?”, but frustratingly little else.

Human language seems to be an ideal interface for computer systems; it is infinitely flexible and the user already knows how to use it! But there are drawbacks. Computer systems that aim to understand arbitrary language are really hard to build, and they also create unrealistic expectations of what the system can do, resulting in user confusion and disappointment.

The next frontier for voice assistants is complex dialogue in challenging domains such as managing schedules, analysing data, and controlling robots. The next generation of systems must learn to map ambiguous human language to precise computer instructions. The mismatch between user expectations and system capabilities is only worsened in these scenarios.

What if we could preserve the familiarity of natural language, while better managing user expectations and simplifying the system to boot? That’s exactly what my student Jesse Mu set out to study. The idea was to use what we called a restricted language interface, one that is a well-chosen subset of full natural language.

Jesse designed an experiment where participants played an interactive computer game called SHRDLURN. In this game, the player is given a set of blocks of different colours, and a “goal”, which is the winning arrangement of blocks. The player types instructions to the computer such as “remove the red blocks” and the computer tries to execute the instruction. The interesting bit is that the computer doesn’t understand language to begin with. In response to a player instruction, it presents the player with a list of block arrangements, and the player picks the arrangement that fits their instructions. Over time, the computer learns to associate instructions with the correct moves, and the correct configuration starts appearing higher up in the list. The system is perfectly trained when the first guess on its list is always the one the player intended.

Sixteen participants took part in our experiment. Half of them played the game with no restriction, but the other half were given specific instructions: they were only allowed to use the following 11 words: all, cyan, red, brown, orange, except, leftmost, rightmost, add, remove, to.

We measured the quality of the final system (i.e., how successfully the computer learnt to map language to instructions) as well as the cognitive load on participants. We found, unsurprisingly, that in the non-restricted setting people used a much wider variety of words, and much longer sentences. However, the restricted language participants seemed to be able to train their systems more effectively. Participants in the restricted language setting also reported needing less effort, and perceived their performance to be higher.

By imposing restrictions, we achieved the same or better system performance, without detriment to the user experience – indeed, participants reported lower effort and higher performance. We think that a guided, consistent language helps users understand the limitations of a system. That’s not to say we’ll never desire a system that understands arbitrary human language. But given the current capabilities of AI systems, we will see diminishing returns in user experience and performance by attempting to accommodate arbitrary natural language input. Rather than considering one of two extremes – a specialised graphical user interface vs a completely natural language interface, designers should consider restricted language interfaces which trade-off full expressiveness for simplicity, learnability and consistency.

Here’s a summary in the form of a poem:

It was not meant to be this way
You cannot understand

This human dance of veiled intent
The spoken word and written hand

But let us meet at halfway point
And share our thoughts with less

To know each other’s will and wish
— not guess

Mu, Jesse, and Advait Sarkar. “Do We Need Natural Language?: Exploring Restricted Language Interfaces for Complex Domains.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, p. LBW2822. ACM, 2019. https://dl.acm.org/citation.cfm?doid=3290607.3312975

# Talking to a bot might help with depression, but you won’t enjoy the conversation

Mental illness is a significant contributor to the global health burden. Cognitive Behavioural Therapy (CBT) provided by a trained therapist is effective. But CBT is not an option for many people who cannot travel long distances, or take the time away from work, or simply cannot afford to visit a therapist.

To provide more scalable and accessible treatment, we could use Artificial Intelligence-driven chatbots to provide a therapy session. It might not (currently) be as effective as a human therapist, but it is likely to be better than no treatment at all. At least one study of a chatbot therapist has shown limited but positive clinical outcomes.

My student Samuel Bell and I were interested in finding out whether chatbot-based therapy could be effective not just clinically, but also in terms of how patients felt during the sessions. Clinical efficacy is only one marker of a good therapy session. Others include sharing ease (i.e., does the patient feel able to confide in the therapist), smoothness of conversation, perceived usefulness, and enjoyment.

To find out, we conducted a study. Ten participants with sub-clinical stress symptoms took part in two 30-minute therapy sessions. Five participants had their sessions with a human therapist, conducted via chat through an internet-based CBT interface. The other five had therapy sessions with a simulated chatbot, through the same interface. At the end of the study, all participants completed a questionnaire about their experience.

We found that in terms of sharing ease and perceived usefulness, neither the human nor the simulated chatbot emerged the clear winner, although participants’ remarks suggested that they found the chatbot less useful. In terms of smoothness of conversation and enjoyment, the chatbot was clearly worse.

Participants felt that the chatbot had a poor ability to “read between the lines”, and they felt that their comments were often ignored. One participant explained their dissatisfaction:

“It was a repetition of what I said, not an expansion of what I said.”

Another participant commented on the lack of shared experience:

“When you tell something to someone, it’s better, because they might have gone through something similar… there’s no sense that the robot cares or understands or empathises.”

Our study has a small sample size, but nonetheless points to clear deficiencies in chatbot-based therapy. We suggest that future research into chatbot CBT acknowledges and explores these areas of conversational recall, empathy, and the challenge of shared experience, in the hope that we may benefit from scalable, accessible therapy where needed.

Bell, Samuel, Clara Wood, and Advait Sarkar. “Perceptions of Chatbots in Therapy.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, p. LBW1712. ACM, 2019. https://dl.acm.org/citation.cfm?id=3313072

# Setwise Comparison: a faster, more consistent way to make judgements

I originally wrote this post in 2016 for the Sparrho blog.

Have you ever wondered whether doctors are consistent in their judgements? In some cases, they really aren’t. When asked to rate videos of patients with multiple sclerosis (a disease that causes impaired movement) on a numeric scale from 0 being completely healthy to 4 being severely impaired, clinicians struggled to be consistent, often giving the same patient different scores at different times, and disagreeing amongst themselves. This difficulty is quite common, and not unique to doctors — people often have to assign scores to difficult, abstract concepts, such as “How good was a musical performance?” or “How much do you agree or disagree with this statement?” Time and time again, it has been shown through research that people are fundamentally inconsistent at this type of activity, no matter the setting or level of expertise.

The field of ‘machine learning’, which can help to automate such scoring (e.g. automatically rating patients according to their disability), is based on the method that we can give the computer a set of examples for which the score is known, in the hope that the computer can use these to ‘learn’ how to assign scores to new, unseen examples. But if the computer is taught from examples where the score is inconsistently assigned, the result is that the computer learns to assign inconsistent, unusable scores to new, unseen examples.

To solve this problem, we brought together an understanding of how humans work with some mathematical tricks. The fundamental insight is that it is easier and more consistent for humans to provide preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”). The problem is, even if you have as few as 50 items to assign scores, you already have 50 x 49 = 2450 ways of pairing them together. This balloons to nearly 10,000 comparisons when you have 100 items. Clearly, this doesn’t scale. So we scale this using a mathematical insight: namely, that if you’ve compared A to B, and B to C, you can guess with reasonably high accuracy what the relationship is between A and C. This ‘guessing’ is done with a computer algorithm called TrueSkill, which was originally invented to help rank people playing multiplayer games by their skill, so that they could be better matched to online opponents. Using TrueSkill, we can reduce the number of comparisons required by a significant amount, so that increasing the number of items no longer results in a huge increase in comparisons. This study has advanced our understanding of how people quantify difficult concepts, and has presented a new method which balances the strengths of people and computers to help people efficiently and consistently provide scores to many items.

### Why is this important for researchers in fields other than computer vision?

This study shows a new way to quickly and consistently have humans rate items on a continuous scale (e.g. “rate the happiness of the individual in this picture on a scale of 1 to 5”). It works through the use of preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”), combined with an algorithmic ranking system which can reduce the need to compare every item with every other item. This was initially motivated by the need to have higher-quality labels for machine learning systems, but can be applied in any domain where humans have difficulty placing items along a scale. In our study we showed that clinicians can use our method to achieve far higher consistency than was previously thought possible in their assessment of motor illness.

We built a nifty tool to help clinicians perform Setwise Comparison, which you can see in the video below: https://www.youtube.com/watch?v=Q1hW-UXU3YE

### Why is this important for researchers in the same field?

This study describes a novel method for efficiently eliciting high-consistency continuous labels, which can be used as training data for machine learning systems, when the concept being labelled has unclear boundaries — a common scenario in several machine learning domains, such as affect recognition, automated sports coaching, and automated disease assessment. Label consistency is improved through the use of preference judgements, that is, labellers sort training data on a continuum, rather than providing absolute value judgements. Efficiency is improved through the use of comparison in sets (as opposed to pairwise comparison), and leveraging probabilistic inference through the TrueSkill algorithm to infer the relationship between data which have not explicitly been compared. The system was evaluated on the real-world case study of clinicians assessing motor degeneration in multiple sclerosis (MS) patients, and was shown to have an unprecedented level of consistency, exceeding widely-accepted clinical ‘gold standards’.

If you’re interested in learning more, we reported this research in detail in the following publications:

Setwise Comparison: Consistent, Scalable, Continuum Labels for Machine Learning
Advait Sarkar, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D’Souza, Peter Kontschieder, Samuel Rota Bulò, Lorcan Walsh, Christian P. Kamm, Yordan Zaykov, Abigail Sellen, Siân E. Lindley
Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI 2016) (pp. 261–271)

Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support – a proof of concept study
Saskia Steinheimer, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi, Jessica Burggraaff, Peter Kontschieder, Frank Dahlke, Abigail Sellen, Bernard M. J. Uitdehaag, Ludwig Kappos, Christian P. Kamm
Disability and Rehabilitation, 2019
(This was a writeup of our 2016 CHI paper for a medical audience)

# How To Generate Any Probability Distribution, Part 2: The Metropolis-Hastings Algorithm

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), and there is a proof of this which I am omitting). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

There is one snag with Metropolis-Hastings: it might be too slow for some applications, because it can need quite a lot of samples before the generated distribution starts to match the desired distribution. One improvement is called Hamiltonian Monte Carlo. Instead of jumping in a random direction according to a normal distribution, think of being a ball rolling around the hilly area — as it goes down slopes, it rolls faster and gathers momentum, which it loses when it climbs up slopes. In practice, Hamiltonian Monte Carlo achieves a better approximation of the desired distribution in many fewer samples than Metropolis-Hastings.

# The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises $P(class | input)$.

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises $P(input | class)*P(class)$.

All you need to build a Bayes classifier is a dataset that allows you to empirically measure $P(class)$ and $P(input | class)$ for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased $P(class | input)$ as $P(input | class)*P(class)$, the denominator $P(input)$ is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value $x$ that maximises the function $f(x) = -x*x$. This is the same value of $x$ that maximises the function $g(x) = f(x)/5$, simply because the denominator, 5, is independent of the value of $x$. We are not interested in the actual output of $f(x)$ or $g(x)$, merely the value of $x$ that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure $P(input | class)$ for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.