Ask people to order things, not score them

Ever graded an essay? Given scores to interview candidates? Given a rating to an item on Amazon? Liked a video on YouTube?

We’re constantly asked to rate or score things on absolute scales. It’s convenient: you only have to look at each thing once to give it a score, and once you’ve got a set of things all reduced to a single number, you can compare them, group them into categories, and find the best one (and the worst).

However, a growing body of evidence points to the fact that humans are simply not very good at giving absolute scores to things. By not very good, we mean there are two problems:

  • Different people give different scores to the same thing (low inter-rater reliability)
  • The same person can give different scores to the same thing, when asked to score it repeatedly (low intra-rater reliability)

But don’t worry! There’s a better way: ordering things, not scoring them. Let me illustrate with two case studies.

Making complex text easier to read

A cool modern application of artificial intelligence / machine learning is “lexical simplification”, which is an ironically fancy way of saying “making complex text easier to read by substituting complex words with simpler synonyms”. This is a great way to make text accessible to young readers and those not fluent in the language. Finding synonyms for words is easy, but detecting which words in a sentence are “complex” is hard.

To teach the AI system what counts as a complex word and what doesn’t, we need to give it a bunch of labelled training examples. That is, a list of words that have already been labelled by humans as being complex or not. Now traditionally, this dataset was generated by giving human labellers some text, and asking them to select the complex words in that text. This is a simple scoring system: every word is scored either 1 or 0, depending on whether the word is complex or not.

However, we knew from previous research that people are inconsistent in giving these absolute scores. So, my student Sian Gooding set out to see if we could do better. She conducted an experiment where half the participants used the old labelling system, and the other half used a sorting system. In the sorting system, participants were given some text, and asked to order the words in that text from least to most complex.

We found that with the sorting system, participants were far more consistent and created a far better labelled training set!

Helping clinicians assess multiple sclerosis

The Microsoft ASSESS-MS project aimed to use the Kinect camera (which captures depth information as well as regular video) to assess the progression of multiple sclerosis. The idea is that because ­­­MS causes degeneration of motor function that manifests in movements such as tremor, it should be possible to use computer vision to track and understands a patient’s movements with the Kinect camera, and assign them a score corresponding to the severity of their illness.

To train the system, we first needed a set of labelled training videos. That is, videos of patients for which neurologists had already provided the severity of illness scores. The problem was that the clinicians were giving scores on a standardised medical scale of 0 to 4, but their scores were suffering from poor consistency! With inconsistent scores, there was little hope that the computer vision system would learn anything.

The video illustrates our deck sorting interface for clinicians

Our solution was to ask clinicians to sort sets of patient videos. We found that giving clinicians “decks” of about 8 videos to sort in order of illness severity worked well – any more than that and the task became too challenging. But we wanted them to rate nearly 400 videos. To go from orderings of 8 videos at a time, to a full set of orderings for the entire dataset, we needed an additional step. For this, we used the TrueSkill algorithm, which is able to merge the results from many orderings (how exactly we did this is detailed in our paper, which you can read here (PDF)).

To our amazement, we found that the resulting scores were significantly more consistent than anything we had previously measured, and handily exceeded clinical gold standards for consistency.

But why does it work?

It’s not yet clear why people are so much better at ordering than scoring. One hypothesis is that it requires people to provide less information. When you score something on a scale of 1-10, you have 10 choices for your answer. But when you compare two items A and B, you only have 3 choices: is A less than B, or is B less than A, or are they equal? However, this hypothesis doesn’t explain what Sian and I saw in the word complexity experiment, since in the scoring condition, users were only assigning scores of 0 or 1. Another hypothesis is that considering how multiple items relate to each other gives people multiple reference points, leading to better decisions. More research is required to test these hypotheses.

In conclusion

People are asked to score things on absolute scales all the time, but they’re not very good at it. We’ve shown that people are significantly better at ordering things in a variety of domains, including identifying complex words, and assessing multiple sclerosis, although we’re not quite sure why.

The next time you find yourself assigning absolute scores to things – try ordering them instead. You might be surprised at the clarity and consistency it brings!

And now, a summary poem:

I wished to know the truth about this choice
And with no guide I found myself adrift
No measure, no register, no voice
But when juxtaposed with others,
brought resolution swift.

Black and white, true and false, desire:
Nature makes a myriad form of each.
Context drives our understanding higher,
To compare things brings them well within our reach.

Want to learn more about our studies? See the publication details below:

Sarkar, Advait, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff et al. “Setwise comparison: Consistent, scalable, continuum labels for computer vision.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 261-271. ACM, 2016. https://doi.org/10.1145/2858036.2858199. Download PDF

Gooding, Sian, Ekaterina Kochmar, Alan Blackwell, and Advait Sarkar. “Comparative judgments are more consistent than binary classification for labelling word complexity.” In Proceedings of the 13th Linguistic Annotation Workshop, pp. 208-214. 2019. https://doi.org/10.18653/v1/W19-4024. Download PDF

Steinheimer, Saskia, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi et al. “Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support–a proof of concept study.” Disability and rehabilitation (2019): 1-7. https://doi.org/10.1080/09638288.2018.1563832

Human language isn’t the best way to chat with Siri or Alexa, probably

The year is 2019. Voice-controlled digital assistants are great at simple commands such as “set a timer…” and “what’s the weather?”, but frustratingly little else.

Human language seems to be an ideal interface for computer systems; it is infinitely flexible and the user already knows how to use it! But there are drawbacks. Computer systems that aim to understand arbitrary language are really hard to build, and they also create unrealistic expectations of what the system can do, resulting in user confusion and disappointment.

The next frontier for voice assistants is complex dialogue in challenging domains such as managing schedules, analysing data, and controlling robots. The next generation of systems must learn to map ambiguous human language to precise computer instructions. The mismatch between user expectations and system capabilities is only worsened in these scenarios.

What if we could preserve the familiarity of natural language, while better managing user expectations and simplifying the system to boot? That’s exactly what my student Jesse Mu set out to study. The idea was to use what we called a restricted language interface, one that is a well-chosen subset of full natural language.

Jesse designed an experiment where participants played an interactive computer game called SHRDLURN. In this game, the player is given a set of blocks of different colours, and a “goal”, which is the winning arrangement of blocks. The player types instructions to the computer such as “remove the red blocks” and the computer tries to execute the instruction. The interesting bit is that the computer doesn’t understand language to begin with. In response to a player instruction, it presents the player with a list of block arrangements, and the player picks the arrangement that fits their instructions. Over time, the computer learns to associate instructions with the correct moves, and the correct configuration starts appearing higher up in the list. The system is perfectly trained when the first guess on its list is always the one the player intended.

description
The figure above shows some example levels from the game. How would you instruct a computer to go from the start to the goal?

Sixteen participants took part in our experiment. Half of them played the game with no restriction, but the other half were given specific instructions: they were only allowed to use the following 11 words: all, cyan, red, brown, orange, except, leftmost, rightmost, add, remove, to.

We measured the quality of the final system (i.e., how successfully the computer learnt to map language to instructions) as well as the cognitive load on participants. We found, unsurprisingly, that in the non-restricted setting people used a much wider variety of words, and much longer sentences. However, the restricted language participants seemed to be able to train their systems more effectively. Participants in the restricted language setting also reported needing less effort, and perceived their performance to be higher.

examples
The figure above illustrates gameplay. A: Game with start and goal states and 2 intermediate states. B: The player issues a language command. “use only…” message appears only for players in restricted condition. C: The player scrolls through candidate configurations until she finds the one matching the meaning of the command. The correct interpretation (bottom) solves the puzzle.

By imposing restrictions, we achieved the same or better system performance, without detriment to the user experience – indeed, participants reported lower effort and higher performance. We think that a guided, consistent language helps users understand the limitations of a system. That’s not to say we’ll never desire a system that understands arbitrary human language. But given the current capabilities of AI systems, we will see diminishing returns in user experience and performance by attempting to accommodate arbitrary natural language input. Rather than considering one of two extremes – a specialised graphical user interface vs a completely natural language interface, designers should consider restricted language interfaces which trade-off full expressiveness for simplicity, learnability and consistency.

Here’s a summary in the form of a poem:

It was not meant to be this way
You cannot understand

This human dance of veiled intent
The spoken word and written hand

But let us meet at halfway point
And share our thoughts with less

To know each other’s will and wish
— not guess

Want to learn more about our study? Read it here (click to download PDF) or see the publication details below:

Mu, Jesse, and Advait Sarkar. “Do We Need Natural Language?: Exploring Restricted Language Interfaces for Complex Domains.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, p. LBW2822. ACM, 2019. https://dl.acm.org/citation.cfm?doid=3290607.3312975

Talking to a bot might help with depression, but you won’t enjoy the conversation

Mental illness is a significant contributor to the global health burden. Cognitive Behavioural Therapy (CBT) provided by a trained therapist is effective. But CBT is not an option for many people who cannot travel long distances, or take the time away from work, or simply cannot afford to visit a therapist.

To provide more scalable and accessible treatment, we could use Artificial Intelligence-driven chatbots to provide a therapy session. It might not (currently) be as effective as a human therapist, but it is likely to be better than no treatment at all. At least one study of a chatbot therapist has shown limited but positive clinical outcomes.

My student Samuel Bell and I were interested in finding out whether chatbot-based therapy could be effective not just clinically, but also in terms of how patients felt during the sessions. Clinical efficacy is only one marker of a good therapy session. Others include sharing ease (i.e., does the patient feel able to confide in the therapist), smoothness of conversation, perceived usefulness, and enjoyment.

To find out, we conducted a study. Ten participants with sub-clinical stress symptoms took part in two 30-minute therapy sessions. Five participants had their sessions with a human therapist, conducted via chat through an internet-based CBT interface. The other five had therapy sessions with a simulated chatbot, through the same interface. At the end of the study, all participants completed a questionnaire about their experience.

We found that in terms of sharing ease and perceived usefulness, neither the human nor the simulated chatbot emerged the clear winner, although participants’ remarks suggested that they found the chatbot less useful. In terms of smoothness of conversation and enjoyment, the chatbot was clearly worse.

Participants felt that the chatbot had a poor ability to “read between the lines”, and they felt that their comments were often ignored. One participant explained their dissatisfaction:

“It was a repetition of what I said, not an expansion of what I said.”

Another participant commented on the lack of shared experience:

“When you tell something to someone, it’s better, because they might have gone through something similar… there’s no sense that the robot cares or understands or empathises.”

Our study has a small sample size, but nonetheless points to clear deficiencies in chatbot-based therapy. We suggest that future research into chatbot CBT acknowledges and explores these areas of conversational recall, empathy, and the challenge of shared experience, in the hope that we may benefit from scalable, accessible therapy where needed.

Want to learn more about our study? Read it here (PDF) or see the publication details below:

Bell, Samuel, Clara Wood, and Advait Sarkar. “Perceptions of Chatbots in Therapy.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, p. LBW1712. ACM, 2019. https://dl.acm.org/citation.cfm?id=3313072