Ever graded an essay? Given scores to interview candidates? Given a rating to an item on Amazon? Liked a video on YouTube?
We’re constantly asked to rate or score things on absolute scales. It’s convenient: you only have to look at each thing once to give it a score, and once you’ve got a set of things all reduced to a single number, you can compare them, group them into categories, and find the best one (and the worst).
However, a growing body of evidence points to the fact that humans are simply not very good at giving absolute scores to things. By not very good, we mean there are two problems:
- Different people give different scores to the same thing (low inter-rater reliability)
- The same person can give different scores to the same thing, when asked to score it repeatedly (low intra-rater reliability)
But don’t worry! There’s a better way: ordering things, not scoring them. Let me illustrate with two case studies.
Making complex text easier to read
A cool modern application of artificial intelligence / machine learning is “lexical simplification”, which is an ironically fancy way of saying “making complex text easier to read by substituting complex words with simpler synonyms”. This is a great way to make text accessible to young readers and those not fluent in the language. Finding synonyms for words is easy, but detecting which words in a sentence are “complex” is hard.
To teach the AI system what counts as a complex word and what doesn’t, we need to give it a bunch of labelled training examples. That is, a list of words that have already been labelled by humans as being complex or not. Now traditionally, this dataset was generated by giving human labellers some text, and asking them to select the complex words in that text. This is a simple scoring system: every word is scored either 1 or 0, depending on whether the word is complex or not.
However, we knew from previous research that people are inconsistent in giving these absolute scores. So, my student Sian Gooding set out to see if we could do better. She conducted an experiment where half the participants used the old labelling system, and the other half used a sorting system. In the sorting system, participants were given some text, and asked to order the words in that text from least to most complex.
We found that with the sorting system, participants were far more consistent and created a far better labelled training set!
Helping clinicians assess multiple sclerosis
The Microsoft ASSESS-MS project aimed to use the Kinect camera (which captures depth information as well as regular video) to assess the progression of multiple sclerosis. The idea is that because MS causes degeneration of motor function that manifests in movements such as tremor, it should be possible to use computer vision to track and understands a patient’s movements with the Kinect camera, and assign them a score corresponding to the severity of their illness.
To train the system, we first needed a set of labelled training videos. That is, videos of patients for which neurologists had already provided the severity of illness scores. The problem was that the clinicians were giving scores on a standardised medical scale of 0 to 4, but their scores were suffering from poor consistency! With inconsistent scores, there was little hope that the computer vision system would learn anything.
The video illustrates our deck sorting interface for clinicians
Our solution was to ask clinicians to sort sets of patient videos. We found that giving clinicians “decks” of about 8 videos to sort in order of illness severity worked well – any more than that and the task became too challenging. But we wanted them to rate nearly 400 videos. To go from orderings of 8 videos at a time, to a full set of orderings for the entire dataset, we needed an additional step. For this, we used the TrueSkill algorithm, which is able to merge the results from many orderings (how exactly we did this is detailed in our paper, which you can read here (PDF)).
To our amazement, we found that the resulting scores were significantly more consistent than anything we had previously measured, and handily exceeded clinical gold standards for consistency.
But why does it work?
It’s not yet clear why people are so much better at ordering than scoring. One hypothesis is that it requires people to provide less information. When you score something on a scale of 1-10, you have 10 choices for your answer. But when you compare two items A and B, you only have 3 choices: is A less than B, or is B less than A, or are they equal? However, this hypothesis doesn’t explain what Sian and I saw in the word complexity experiment, since in the scoring condition, users were only assigning scores of 0 or 1. Another hypothesis is that considering how multiple items relate to each other gives people multiple reference points, leading to better decisions. More research is required to test these hypotheses.
People are asked to score things on absolute scales all the time, but they’re not very good at it. We’ve shown that people are significantly better at ordering things in a variety of domains, including identifying complex words, and assessing multiple sclerosis, although we’re not quite sure why.
The next time you find yourself assigning absolute scores to things – try ordering them instead. You might be surprised at the clarity and consistency it brings!
And now, a summary poem:
I wished to know the truth about this choice
And with no guide I found myself adrift
No measure, no register, no voice
But when juxtaposed with others,
brought resolution swift.
Black and white, true and false, desire:
Nature makes a myriad form of each.
Context drives our understanding higher,
To compare things brings them well within our reach.
Want to learn more about our studies? See the publication details below:
Sarkar, Advait, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff et al. “Setwise comparison: Consistent, scalable, continuum labels for computer vision.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 261-271. ACM, 2016. https://doi.org/10.1145/2858036.2858199. Download PDF
Gooding, Sian, Ekaterina Kochmar, Alan Blackwell, and Advait Sarkar. “Comparative judgments are more consistent than binary classification for labelling word complexity.” In Proceedings of the 13th Linguistic Annotation Workshop, pp. 208-214. 2019. https://doi.org/10.18653/v1/W19-4024. Download PDF
Steinheimer, Saskia, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi et al. “Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support–a proof of concept study.” Disability and rehabilitation (2019): 1-7. https://doi.org/10.1080/09638288.2018.1563832