Tell, don’t show: how to teach AI

Should we teach good behaviour to Artificial Intelligence (AI) through our feedback, or should we try and tell it a set of rules explaining what good behaviour is? Both approaches have advantages and limitations, but when we tested them in a complex scenario, one of them emerged the winner.

If AI is the future, how will we tell it what we want it to do?

Artificial intelligence is capable of crunching through enormous datasets and providing us assistance in many facets of our lives. Indeed, it seems this is our future. An AI assistant may help you decide what gifts to buy for a friend, or what books to read, or who to meet, or what to do on the weekend. In the worst case, of course, this could be dystopian – AI controls us, and not the other way around, we’ve all heard that story – but in the best case, it could be incredibly stimulating, deeply satisfying, and profoundly liberating.

But an important and unsolved problem is that of specifying our intent, our goals, and our desires, for the AI system. Assuming we know what we want from the AI system (this is not always the case, as we’ll see later), how do we teach the system? How do we help the system learn what gifts might be good for a friend, what books we might like to read, the people we might like to meet, and the weekend activities we care about?

There are many parts to this problem, and many solutions. The solution ultimately depends on the context in which we’re teaching the AI, and the task we’re recruiting it to do for us. So in order to study this, we need a concrete problem. Luckily for me, Ruixue Liu decided to join us at Microsoft for an internship in which she explored a unique and interesting problem indeed. The problem we studied was how to teach an AI system to give us information about a meeting, where for some reason, we can’t see the meeting room.

Our problem: eyes-free meeting participation

When people enter a meeting room, they can typically pick up several cues: who is in the meeting? Where in the room are they? Are they seated or standing? Who is speaking? What are they doing? Research shows that not having this information can be very detrimental to meeting participation.

Unfortunately, in many modern meeting scenarios, this is exactly the situation we find ourselves in. People often join online meetings remotely without access to video, due to device limitations, poor Internet connections, or because they are engaged in parallel “eyes-busy” tasks such as driving, cooking, or going to the gym. People who are blind or low vision also describe this lack of information as a major hurdle in meetings, whether in-person or online.

We think an AI system could use cameras in meeting rooms to present this information to people who, for whatever reason, cannot see the meeting room. This information could be relayed via computer-generated speech, or special sound signals, or even through haptics. Given that the participant only has a few moments to understand this information as they join a meeting, it’s important that only the most useful information is given to the user. Does the user want to know about people’s locations? Their pose? Their clothes? What information would be useful and helpful for meeting participation?

However, what counts as ‘most useful’ varies from user to user, and context to context. One goal of the AI system is to learn this, but it can’t do so without help from the user. Here is the problem: should the user tell the system what information is most useful, by specifying a set of rules about what information they want in each scenario, or should the user give feedback to the system, saying whether or not it did a good job over the course of many meetings, with the aim of teaching it correct behaviour in the long term?

Our study, in which we made people attend over 100 meetings

Don’t worry – luckily for the sanity of our participants, these weren’t real meetings. We created a meeting simulator which could randomly generate meeting scenarios. Each simulated meeting had a set of people – we generated names, locations (within the room), poses, whether they were speaking or not, and several other pieces of information. Because we were testing eyes-free meeting participation, we didn’t visualise this information – the objective was for the user to train the system to present a useful summary of this information in audio form.

We conducted a study in which 15 participants used two approaches to ‘train’ the system to relay the information they wanted. One approach was a rule-based programming system, where the participant could specify “if this, then that”-style rules. For example, “if the number of people in the meeting is less than 5, then tell me the names of the people in the meeting”.

The other approach was a feedback-based training system (our technical approach was to use a kind of machine learning called deep reinforcement learning). In the feedback-based training system, the user couldn’t say what they wanted directly, but instead, as they went to various (simulated) meetings, the system would do its best to summarise the information. After each summary, the user provided simple positive/negative feedback, answering “yes” or “no” to the question of whether they were satisfied with the summary.

Each participant tried both systems, one after the other in randomised order. We let participants play around, test and tweak and teach the AI as much as they liked, and try out the system’s behaviour on as many simulated meetings as they liked. Many participants “attended” well over 100 meetings, with two participants choosing to attend nearly 160 meetings over the course of the experiment! Who knew meetings could be such fun!

We asked participants to fill out a few questionnaires about their experience of interacting with both systems, and we conducted follow-up interviews to talk about their experience, too.

Results

Participants reported a significantly lower cognitive load and higher satisfaction when giving the system rules, than giving feedback. Thus, it was easier and more satisfactory to tell the AI how to behave, than to show it how to behave through feedback.

Rule-based programming gave participants a greater feeling of control and flexibility, but some participants found it hard at the beginning of the experiment to formulate rules from scratch. Participants also found it hard to understand how different rules worked together, and whether conflicting rules had an order of precedence (they did not).

Feedback-based teaching was seen by participants as easier, but much more imprecise. There were instances where the system did something almost correct, but because the user could only say whether the behaviour was good or bad, they did not have the tools to give more nuanced feedback to the system. Moreover, people don’t just know their preferences, they figure them out over time. With feedback-based teaching, participants worried that they were ‘misleading’ the system with poor feedback at the early stages of training, while they were still figuring out what their preferences were.

Conclusion

Based on our results, we would recommend a rule-based programming interface. But as explained, we found several advantages and limitations to both approaches. In both cases, we found that the first step was for the human to figure out what they wanted from the system! This is hard if the user doesn’t have a clear idea of what the system can and can’t do; our first recommendation is for system designers to make this clear.

Our participants also had a hard time in both cases expressing their preferences exactly: with rules, it was because the rule-based programming language was complex, and with feedback-based teaching, it was because yes/no feedback isn’t precise enough. Our second recommendation is to make clear to users what actions they need to take to specify certain preferences.

Finally, it was difficult for participants to understand the system they finally trained; it was difficult to know what rules would apply in certain scenarios, and they also found the feedback-trained system to be unpredictable. Our third recommendation is to provide more information as to why the system does what it does in certain scenarios.

In the future, we should consider blending the two approaches, to get the best of both worlds. For example, the feedback-based system could be used to generate candidate rules, to help users form a better idea of their preferences, or detect hard-to-specify contexts. Rule-based systems could help define context, explain behaviour learnt by the system, and provide a way for specifying and editing information not captured by the feedback-trained system. We aren’t sure what this might look like, but we’re working on it. Until then, let’s aim to tell, and not show, what we want our AI to do.

Here’s a summary poem:

Yes, no, a little more
What do you want?
I can do this, this, and this
But that I can’t

Tell me and I’ll show you
What you can’t see
I’ll do my best to learn from
What you tell me

Want to learn more? Read our study here (click to download PDF), and see the publication details below:

Liu, Ruixue, Advait Sarkar, Erin Solovey, and Sebastian Tschiatschek. “Evaluating Rule-based Programming and Reinforcement Learning for Personalising an Intelligent System.” In IUI Workshops. 2019. http://ceur-ws.org/Vol-2327/#ExSS

Setwise Comparison: a faster, more consistent way to make judgements

I originally wrote this post in 2016 for the Sparrho blog.

Have you ever wondered whether doctors are consistent in their judgements? In some cases, they really aren’t. When asked to rate videos of patients with multiple sclerosis (a disease that causes impaired movement) on a numeric scale from 0 being completely healthy to 4 being severely impaired, clinicians struggled to be consistent, often giving the same patient different scores at different times, and disagreeing amongst themselves. This difficulty is quite common, and not unique to doctors — people often have to assign scores to difficult, abstract concepts, such as “How good was a musical performance?” or “How much do you agree or disagree with this statement?” Time and time again, it has been shown through research that people are fundamentally inconsistent at this type of activity, no matter the setting or level of expertise.

The field of ‘machine learning’, which can help to automate such scoring (e.g. automatically rating patients according to their disability), is based on the method that we can give the computer a set of examples for which the score is known, in the hope that the computer can use these to ‘learn’ how to assign scores to new, unseen examples. But if the computer is taught from examples where the score is inconsistently assigned, the result is that the computer learns to assign inconsistent, unusable scores to new, unseen examples.

To solve this problem, we brought together an understanding of how humans work with some mathematical tricks. The fundamental insight is that it is easier and more consistent for humans to provide preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”). The problem is, even if you have as few as 50 items to assign scores, you already have 50 x 49 = 2450 ways of pairing them together. This balloons to nearly 10,000 comparisons when you have 100 items. Clearly, this doesn’t scale. So we scale this using a mathematical insight: namely, that if you’ve compared A to B, and B to C, you can guess with reasonably high accuracy what the relationship is between A and C. This ‘guessing’ is done with a computer algorithm called TrueSkill, which was originally invented to help rank people playing multiplayer games by their skill, so that they could be better matched to online opponents. Using TrueSkill, we can reduce the number of comparisons required by a significant amount, so that increasing the number of items no longer results in a huge increase in comparisons. This study has advanced our understanding of how people quantify difficult concepts, and has presented a new method which balances the strengths of people and computers to help people efficiently and consistently provide scores to many items.

Why is this important for researchers in fields other than computer vision?

This study shows a new way to quickly and consistently have humans rate items on a continuous scale (e.g. “rate the happiness of the individual in this picture on a scale of 1 to 5”). It works through the use of preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”), combined with an algorithmic ranking system which can reduce the need to compare every item with every other item. This was initially motivated by the need to have higher-quality labels for machine learning systems, but can be applied in any domain where humans have difficulty placing items along a scale. In our study we showed that clinicians can use our method to achieve far higher consistency than was previously thought possible in their assessment of motor illness.

We built a nifty tool to help clinicians perform Setwise Comparison, which you can see in the video below: https://www.youtube.com/watch?v=Q1hW-UXU3YE

Why is this important for researchers in the same field?

This study describes a novel method for efficiently eliciting high-consistency continuous labels, which can be used as training data for machine learning systems, when the concept being labelled has unclear boundaries — a common scenario in several machine learning domains, such as affect recognition, automated sports coaching, and automated disease assessment. Label consistency is improved through the use of preference judgements, that is, labellers sort training data on a continuum, rather than providing absolute value judgements. Efficiency is improved through the use of comparison in sets (as opposed to pairwise comparison), and leveraging probabilistic inference through the TrueSkill algorithm to infer the relationship between data which have not explicitly been compared. The system was evaluated on the real-world case study of clinicians assessing motor degeneration in multiple sclerosis (MS) patients, and was shown to have an unprecedented level of consistency, exceeding widely-accepted clinical ‘gold standards’.

To learn more

If you’re interested in learning more, we reported this research in detail in the following publications:

Setwise Comparison: Consistent, Scalable, Continuum Labels for Machine Learning
Advait Sarkar, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D’Souza, Peter Kontschieder, Samuel Rota Bulò, Lorcan Walsh, Christian P. Kamm, Yordan Zaykov, Abigail Sellen, Siân E. Lindley
Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI 2016) (pp. 261–271)

Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support – a proof of concept study
Saskia Steinheimer, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi, Jessica Burggraaff, Peter Kontschieder, Frank Dahlke, Abigail Sellen, Bernard M. J. Uitdehaag, Ludwig Kappos, Christian P. Kamm
Disability and Rehabilitation, 2019
(This was a writeup of our 2016 CHI paper for a medical audience)