Research through design, and the role of theory in Human-Computer Interaction

collaborate_collaboration_creative_design_designer_group_groupware_hands-911060.jpg!d

Every day, designers create the world around us: every website you’ve visited, book or magazine you’ve read, every app you’ve used on your phone, every chair you’ve sat on, almost everything around you has been consciously designed by someone.

Since this activity is so important, an area of academia known as design research is concerned with studying how design works, and how to do it better. The ultimate aim for design research is to create a theory for designing a particular thing (such as websites, or books, or apps, or chairs) that teaches us how to design that thing well. One way to try and produce these theories is to actually design a bunch of things (websites or books or apps or chairs), and then document what worked and what didn’t. This is the basic idea behind research through design. In the rest of this post, I’ll explain a bit more about research through design, as well as what we can realistically expect from the theories we produce using this process.

What’s research through design?
Design shifts the world from its current state into a “preferred” state through the production of a designed artefact. My PhD dissertation described the design of two visual analytics tools I developed, with a focus on documenting and theorising those aspects of the design that (a) facilitate the specific user tasks I identified as being important and (b) reduced expertise requirements for users. Thus, the approach to knowledge production was research through design (Frayling, 1993). The distinction between research through design, and merely design, is one of intent. In the former, design is practiced with the primary intent of producing knowledge for the community of academics and practitioners. Consequently, the design artefact cannot stand in isolation – it must be accompanied by some form of discourse intended to communicate the embodied knowledge result to the community. Moreover, this discourse must make explicit how the artefact is sufficiently novel to contribute to knowledge. In a non-research design activity, neither annotative discourse nor novelty is necessary for success.

Zimmerman et al. (2007) propose four general criteria for evaluating research through design contributions: process, invention, relevance, and extensibility. Process refers to the rigour and rationale of the methods applied to produce the design artefact, and Invention refers to the degree of academic novelty. Relevance refers to the ability of the contribution to have a wider impact. Extensibility is the ability of the knowledge as documented to be built upon by future research.

Why is there no such thing as a complete theory of design?
When designing systems in some domain, it may seem an attractive proposition to seek a theory of design that not only characterises specifically the nature of these systems and how their important properties may be measured, but also prescribes a straightforward, deterministic strategy for the design of such systems. When I started out in my research, I wanted to produce a theory for how to design systems that would let non-experts use visual tools to perform statistics and machine learning. I initially anticipated that such a prescriptive theory would be elusive for multiple reasons, including the nascency of interactive machine learning, the incomplete characterisation of potential applications, and a wariness of the challenges surrounding “implications for design” (Stolterman, 2008).

Towards the end of my PhD, I came to the position (and I still hold it) that a complete design theory is not only elusive, but impossible – not just for visual analytics tools, but any design domain. This is because theory underspecifies design, and design underspecifies theory (Gaver, 2012). Theory underspecifies design because a successful design activity must culminate as an ultimate particular (Stolterman, 2008): an instantiated, designed artefact, subject to innumerable decisions, situated in a particular context, and limited by time and resource constraints. Design problems are inherently wicked problems (Buchanan, 1992); they can never be formulated to a level of precision which affords ‘solving’ through formal methods, and no theory for design can profess to provide a recommendation for every design decision. Conversely, design underspecifies theory, in the sense that an ultimate particular will fail to exemplify some, or even many, of the nuances captured in an articulated theory.

This is not to say that we should do away with theory altogether and focus solely on artefacts themselves. Gaver’s view, to which I am sympathetic, is that design theory is “provisional, contingent, and aspirational”. The aim of design theory is to capture and communicate knowledge generated during the design process, in the belief that it may sometimes, but not always, lead to successful designs in the future.

This post is heavily based on an excerpt from my 2016 PhD thesis:
Interactive analytical modelling, Advait Sarkar, PhD, University of Cambridge, 2016

References

Frayling, Christopher. Research in art and design. Royal College of Art, London, 1993.

Zimmerman, John; Forlizzi, Jodi, and Evenson, Shelley. Research through design as a method for interaction design research in HCI. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 493–502. ACM, 2007.

Stolterman, Erik. The nature of design practice and implications for interaction design research. International Journal of Design, 2(1), 2008.

Gaver, William. What should we expect from research through design? In Proceedings of the SIGCHI conference on human factors in computing systems, pages 937–946. ACM, 2012.

Buchanan, Richard. Wicked problems in design thinking. Design issues, 8(2):5–21, 1992.

Setwise Comparison: a faster, more consistent way to make judgements

I originally wrote this post in 2016 for the Sparrho blog.

Have you ever wondered whether doctors are consistent in their judgements? In some cases, they really aren’t. When asked to rate videos of patients with multiple sclerosis (a disease that causes impaired movement) on a numeric scale from 0 being completely healthy to 4 being severely impaired, clinicians struggled to be consistent, often giving the same patient different scores at different times, and disagreeing amongst themselves. This difficulty is quite common, and not unique to doctors — people often have to assign scores to difficult, abstract concepts, such as “How good was a musical performance?” or “How much do you agree or disagree with this statement?” Time and time again, it has been shown through research that people are fundamentally inconsistent at this type of activity, no matter the setting or level of expertise.

The field of ‘machine learning’, which can help to automate such scoring (e.g. automatically rating patients according to their disability), is based on the method that we can give the computer a set of examples for which the score is known, in the hope that the computer can use these to ‘learn’ how to assign scores to new, unseen examples. But if the computer is taught from examples where the score is inconsistently assigned, the result is that the computer learns to assign inconsistent, unusable scores to new, unseen examples.

To solve this problem, we brought together an understanding of how humans work with some mathematical tricks. The fundamental insight is that it is easier and more consistent for humans to provide preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”). The problem is, even if you have as few as 50 items to assign scores, you already have 50 x 49 = 2450 ways of pairing them together. This balloons to nearly 10,000 comparisons when you have 100 items. Clearly, this doesn’t scale. So we scale this using a mathematical insight: namely, that if you’ve compared A to B, and B to C, you can guess with reasonably high accuracy what the relationship is between A and C. This ‘guessing’ is done with a computer algorithm called TrueSkill, which was originally invented to help rank people playing multiplayer games by their skill, so that they could be better matched to online opponents. Using TrueSkill, we can reduce the number of comparisons required by a significant amount, so that increasing the number of items no longer results in a huge increase in comparisons. This study has advanced our understanding of how people quantify difficult concepts, and has presented a new method which balances the strengths of people and computers to help people efficiently and consistently provide scores to many items.

Why is this important for researchers in fields other than computer vision?

This study shows a new way to quickly and consistently have humans rate items on a continuous scale (e.g. “rate the happiness of the individual in this picture on a scale of 1 to 5”). It works through the use of preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”), combined with an algorithmic ranking system which can reduce the need to compare every item with every other item. This was initially motivated by the need to have higher-quality labels for machine learning systems, but can be applied in any domain where humans have difficulty placing items along a scale. In our study we showed that clinicians can use our method to achieve far higher consistency than was previously thought possible in their assessment of motor illness.

We built a nifty tool to help clinicians perform Setwise Comparison, which you can see in the video below: https://www.youtube.com/watch?v=Q1hW-UXU3YE

Why is this important for researchers in the same field?

This study describes a novel method for efficiently eliciting high-consistency continuous labels, which can be used as training data for machine learning systems, when the concept being labelled has unclear boundaries — a common scenario in several machine learning domains, such as affect recognition, automated sports coaching, and automated disease assessment. Label consistency is improved through the use of preference judgements, that is, labellers sort training data on a continuum, rather than providing absolute value judgements. Efficiency is improved through the use of comparison in sets (as opposed to pairwise comparison), and leveraging probabilistic inference through the TrueSkill algorithm to infer the relationship between data which have not explicitly been compared. The system was evaluated on the real-world case study of clinicians assessing motor degeneration in multiple sclerosis (MS) patients, and was shown to have an unprecedented level of consistency, exceeding widely-accepted clinical ‘gold standards’.

To learn more

If you’re interested in learning more, we reported this research in detail in the following publications:

Setwise Comparison: Consistent, Scalable, Continuum Labels for Machine Learning
Advait Sarkar, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D’Souza, Peter Kontschieder, Samuel Rota Bulò, Lorcan Walsh, Christian P. Kamm, Yordan Zaykov, Abigail Sellen, Siân E. Lindley
Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI 2016) (pp. 261–271)

Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support – a proof of concept study
Saskia Steinheimer, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi, Jessica Burggraaff, Peter Kontschieder, Frank Dahlke, Abigail Sellen, Bernard M. J. Uitdehaag, Ludwig Kappos, Christian P. Kamm
Disability and Rehabilitation, 2019
(This was a writeup of our 2016 CHI paper for a medical audience)