Product design and the myth of faster horses

Product design often encounters a tension between solving observable customer needs (reactive design), and inventing novel experiences without concrete basis in current customer behaviour, but which designers believe will be valuable (proactive design). Both reactive and proactive design can produce successful results. However, the practical question remains: given that most product design teams have finite time and resources, which should be prioritised as the default approach for design practice?

In this article, I explore the differences between the reactive and proactive approaches, note some of their advantages and limitations, and argue that reactive design, grounded in empirical studies of users, is almost always the better choice.

The debate between reactive and proactive design

Reactive design seeks to improve people’s current experiences by listening carefully to what they need and continuously evolving a product. Proactive design seeks to invent new categories of experience based on the designer’s skill and intuition and making large, disjoint leaps in the product. While reactive design results in fairly straightforward and predictable improvements, there is always an element of uncertainty around whether a proactive design will succeed. Reactive and proactive design therefore differ in their aims, the resources they draw upon, and the impact on product.

A comparison of reactive and proactive design.

At first glance, proactive design sounds more impressive and attractive, and in contrast, reactive design sounds dull, incremental and boring. Who wouldn’t want to be inventing new categories of experience based on their craft skill and intuition? This approach puts the designer in the role of a creative hero-saviour, supported by our cultural myth of the lone creative genius.

Reactive design resembles applied research

The differences between reactive and proactive design bear a resemblance to the much older debate around basic versus applied research. In particular, which to fund. Perhaps the oldest example of organised funding for research (at least in the West) is the British Board of Longitude, founded 1714, which disbursed several grants over its more than 100-year history to those seeking solutions to the problem of accurately determining longitude at sea. This was a decidedly applied problem that nonetheless resulted in the production of much basic knowledge in astronomy, as well as applied knowledge in horology.

As government-funded research programmes gradually came to be seen as indispensable tools of nation-building, the approach of funding only solutions to specific and pertinent current problems was derided for its naïve short-sightedness by those who saw the value of fostering innovation for its own sake. By the mid 1800s, spearheaded by William Whewell, the British Association for the Advancement of Science had pioneered the disbursal of scientific grants to works-in-progress towards all sorts of scientific aims, a pattern that gradually spread throughout Europe (but which the elitist Royal Society lagged in adopting, in part because such grants enabled ordinary people and not just the leisure class to participate in science).

In 1979, Cosmic Search magazine gleefully mocked reactive research through the tale of Lord Allsen, set in the early 1800s. In this story (fictitious, as far as I can tell) Allsen imperceptively advocates only for funding research with immediate benefits, brushing aside Hans Christian Oersted’s discovery that a current passed through a wire affects the needle of a nearby compass:

It would be a waste of the taxpayers’ money, he pointed out, to spend even one pence to find out more about what a wire would do to a compass. What was needed, he said, was more practical research like developing longer burning, brighter candles that didn’t need their wicks trimmed as often or breeding faster horses so that messages could be carried between cities more swiftly. Lord Allsen promised to do all he could to support research for better candles and faster horses.

[… nonetheless, the Danish Parliament decides to fund Oersted …]

Oersted extended his research and this led to further work by Faraday, Maxwell, Hertz, Edison and many others so that today we have electric motors, generators, and lights, the telegraph and telephone, radio, television, electronic computers and a host of other devices. It is interesting to speculate that if Lord Allsen and other Lord Allsens had had their way we might enjoy none of these today, the world would have been spared the blight of everything electrical but we might have better candles and horses so fast that, working in relays, a message could reach Chicago from New York in no more than 3 days.

The tale convincingly makes the case against the reactive approach. Indeed I am personally in favour of funding basic research. However, product design is not research; the objective of the former is to make a product for people to use, the objective of the latter is to make knowledge. Because research is an imperfect analogy for design, the arguments for basic research cannot be directly interpreted as arguments for proactive design.

Nonetheless, proactive design and basic research share an underlying logic: that cues from the present environment alone cannot tell you what is useful, interesting or worth pursuing. Conversely, reactive design and applied research both start from the premise that in order to be worth pursuing, an idea must be grounded in cues from the environment. The former looks within, the latter looks without, showcasing the timeless tension between Aristotelian and Empirical approaches to knowledge-making.

Related debates, which we will not enter here, concern whether human development reflects “continuities or mutations”, and the “push and pull” models of innovation.

In a corporate research lab where product teams work closely with research teams, the different objectives of research and design (i.e., making knowledge versus making products) create an inherent tension both between the teams but also within the aims of the work itself. Managed well this tension can produce good research and products, but managed poorly it can destroy both the research and the product.

A digression: Henry Ford and his “faster horses”

Henry Ford is said to have summarised another case for proactive design thus: “If I had asked my customers what they wanted, they would have said a faster horse”. This argues that people’s needs are latent, and because they can only articulate their needs in terms of their existing experiences, it is impossible to produce genuinely new ideas simply by listening to what they say.

The 1908 Ford Model T: far from unimaginable by the ordinary customer. Source: Wikimedia Commons

I’m not a fan of the “faster horses” quote. For one thing, there is no evidence that Henry Ford actually said this. For another, it doesn’t accurately reflect public opinion and technical knowledge at the time. Benz was selling automobiles in the 1890s, and the Oldsmobile was mass produced from the 1900s, many years before the Model T. And while at this time cars were still luxuries inaccessible to most, fast and affordable steam passenger travel had been operating for nearly 100 years. There is therefore no reason that in 1908, the year the Model T was released, a customer interviewed by Ford about their travel needs would be so blinkered as to only ask for faster horses. The third reason I’m not a fan of the “faster horses” parable is that it implicitly positions automobiles as being an unambiguously good solution to the problem of public transport. However, with hindsight, it is clear to see that the personal automobile has been a disaster for city planning, the environment, and societal fabric. The final reason is that Ford was a well-documented anti-semite who used his power, influence and the Ford dealership network to publish material precipitating anti-Jewish sentiment throughout America (and later the Weimar republic); viewed through this lens, his disregard for customer input in the design process can be viewed as another manifestation of the idea that some people’s thoughts and opinions are inherently better than others.

All these issues notwithstanding, because it is so widely known, the metaphor of faster horses is useful to communicate the spirit of the conundrum between reactive and proactive design.

The perils of being proactive

What the parables of the faster horses and Lord Allsen, and in general the history of success of basic research would suggest, is that the proactive approach is preferable and produces better results. However, this is not the case, for at least two important reasons.

Most design work is evolutionary

The first major problem with the proactive approach is that most of the time, the everyday work of design is not revolutionary, it is evolutionary. The science historian Thomas Kuhn refers to the activity of “normal science” as “puzzle-solving”, where a well-defined problem is addressed using a well-defined set of rules. Most of the problems designers face are of this kind: How can we help users understand this aspect of the system? Is it clear to the user what to do here? Are we overwhelming the user with information? Could we make this process faster and less error-prone? These questions can only be answered by grounding design decisions in actual user behaviour, i.e., by taking a reactive stance.

Even the poster child of “revolutions” in the technology sphere, the iPhone, has been developed through the gradual accumulation of evolutionary ideas. As Apple analyst John Gruber notes:

That’s how Apple builds its platforms. It’s a slow and steady process of continuous iterative improvement—so slow, in fact, that the process is easy to overlook if you’re observing it in real time. Only in hindsight is it obvious […] Apple’s iterative development process doesn’t just add, it adapts. […] We may never see an iPhone that utterly blows away the prior year’s, but we’ll soon have one that utterly blows away the original iPhone.

The absurdity of calling anything produced by Silicon Valley a “revolution” notwithstanding (to see why it is absurd, one need only compare tech product “revolutions” and their societal impact to those of the Agricultural, Industrial, and French varieties…), even if we take the original iPhone to be an instance of revolution, it is clear that most of the design work done on the iPhone in its now 15-year history is evolutionary and reactive in nature.

As a design tool, and an antidote to the hubristic tech saviour complex of Silicon Valley design practice, I propose what we might call the humility razor. The humility razor is the principle that the solution to any given design problem is usually an evolution of what has come before. It is named by analogy to Occam’s razor (prefer the simpler explanation) or Hanlon’s razor (never attribute to malice that which is adequately explained by stupidity).

Technology and society shape each other in unpredictable ways

The second major problem with the proactive approach is that technology shapes society (Veblen’s technological determinism) and society shapes technology (via Durkheim’s social determinism). The design of technology shapes and persuades users to behave in a certain way, but conversely users adapt, modify, extend, subvert, and repurpose technology for use in ways completely unanticipated by the designers. The combination of these two forces is sometimes referred to, perhaps unimaginatively, as mutual shaping.

Mutual shaping means that there are some aspects of design that can only be understood and determined by observing the interaction between technology and society. Examples abound as to why this must be the case.

The television remote control was not part of the initial television viewing experience, but could only have been developed in time in response to user behaviour and desires (namely, not sitting within arm’s reach of the television) as well as changes in the media landscape (having many television channels meant having more reasons to interact with the television).

The Internet was primarily intended for government use, with restrictions on commercial use until the mid-1990s; the Web (which is the part of the Internet that delivers websites, and which most people think of as being synonymous with the Internet) did not arrive until 30 years after the first government networks. The Web could only have developed in response to the growing adaptive use of the Internet, mostly by academics, as a place to store, browse, and navigate documents, for which the previous paradigms of internet use (centred around the retrieval of static text files) were grossly inadequate.

The rigid conformity of the modern social network (highly limited personalisation, finite well-defined user interactions) exemplified by Facebook, could only have been invented in response to the decade-long experiment in the “cacophony of pimped profiles” that was MySpace. MySpace laid the groundwork for Facebook in at least two ways. First, it normalised the transformation of Internet users (in the West) into immaterial labourers. Second, it showed how excessive personalisation enabled bad actors and also leeched user time and attention from their primary “job” of creating monetisable, advertising-friendly content.

Cases such as the television remote, the Web, and modern social networking demonstrate that the need for observation runs deeper than simply releasing an imperfect version of a product and refining it in response to user feedback; they show that entire features and categories of user needs, impossible to anticipate, can emerge from the interaction between people and technology. Under a mutual shaping regime, product design needs to react to survive.

Product designers: study your customer

Given the issues posed by mutual shaping and the maxim of the humility razor, it should come as no surprise that for most design work, I would advocate for a reactive approach: listening to customers, observing real data about the interaction between people and technology, and trying to solve their problems.

Don’t just build what the customer asks for

However, reactive design can also be done poorly. Asking customers what they want, and then literally building what the customer asks for is a bad version of reactive design. Customers can only articulate their needs in terms of familiar reference points; technologies they already have interacted with or know of. They do not (necessarily) understand the capabilities or constraints of technology. The design researcher’s job is to interpret these articulations (“faster horses”) and discover the underlying need (“efficient transport”). One way to do this is through repeated questioning during user studies; the ‘five whys’ technique attributed to Sakichi Toyoda (of Toyota Motor Corporation) can be effective here. When the customer says they want feature X, ask why. When they respond, ask why again. And so on, until the root cause of the problem is determined.

Another issue with reactive design is that because customers are not familiar with the capabilities of technology, they may not realise that some aspect of their technology use can be improved; they cannot problematise (i.e., view as a “problem”) aspects of their technology use that a trained designer can spot. Here again, relying on what customers say will betray the design researcher. “Customers didn’t say X was a problem” doesn’t necessarily mean that it isn’t worth improving the experience of X.

“Problems customers don’t know they have” or “unknown unknowns” can usually be detected with a careful analysis of behavioural observations. Design researchers are trained to consider the entire user workflow, and spot opportunities where steps can be simplified, time and effort saved, errors reduced, joy and delight introduced, and more value added. These opportunities may not be apparent to the user, but in observing users interact with the system, they become apparent to the researcher. Observing users is key, and is what makes the approach reactive.

Proactive design can still be a valuable tool

The reactive approach does not dismiss the importance of the designer’s craft skill and intuition. These are valuable tools both in determining the root cause of a problem as well as detecting unarticulated opportunities from user observations.

Nor does the reactive approach dismiss out of hand any idea that does not have its basis in empirical observation. Design ideas, even good ones, can arise ex nihilo (although introspection often reveals that an external observation was the source of inspiration). In such cases, the spirit of proactive design can be captured quite well using contemporary research techniques. To test a potentially revolutionary idea, it is not necessary to invest enormous time and effort into building and launching a product. Rather, these ideas can be evaluated by investing as little as possible in communicating the idea effectively, to see if it resonates with customers.

Paper prototyping is an inexpensive way of testing a design idea without investing a lot of engineering effort. Source: Wikimedia Commons

Techniques such as paper prototyping (mocking up an interface out of paper sketches), Wizard of Oz studies (where you pretend that a system works, but behind the scenes a person is manipulating the interface to make it look like it works), and even design fictions (stories written to envision an alternative future) can help customers experience and react to an idea. If the idea is revealed through such studies to be less brilliant than the inventors think (as is often the case), then the crisis that would have arisen had they plunged headlong into building it has been inexpensively averted. In this way, it is possible to incorporate the best of both reactive and proactive approaches into design practice.


Reactive design emphasises direct responses to observable user needs; proactive design draws on the inspiration and intuition of the design practitioner. We have seen how, despite the lustre of proactive design, it is flawed because most daily design work is evolutionary, and also because technology and society shape each other in unanticipated ways.

Reactive and proactive approaches are both useful implements in a designer’s toolkit. The challenge is combining them in practice to best solve the customer’s problem. A reactive, empirically-grounded method is effective for most design work, and lightweight, low-investment methods such as paper prototyping can help test ideas proactively.

Notes and references

… British Association for the Advancement of Science had pioneered the disbursal of scientific grants … see Snyder, Laura J. The philosophical breakfast club: four remarkable friends who transformed science and changed the world. Random House Digital, Inc., 2011.

… tension between Aristotelian and Empirical approaches … see, e.g., Cushing, J. (1998). Aristotle and Francis Bacon. In Philosophical Concepts in Physics: The Historical Relation between Philosophy and Scientific Theories (pp. 15-28). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139171106.004

… activity of “normal science” as “puzzle-solving” … see Kuhn, Thomas S. “The Structure of Scientific Revolutions. Chicago (University of Chicago Press) 1962.” (1962).

… human development reflects “continuities or mutations” … see Lewis Mumford (1946). Garden Cities and the Metropolis: A Reply. The Journal of Land & Public Utility Economics, 22(1), 66–69. doi:10.2307/3159217. The historian and architecture critic Lewis Mumford published this in reply to an article by Lloyd Rodwin. Mumford disagreed with Rodwin’s statement that “thinking, however imaginative, must reflect continuities, not mutations, if it is to find practical expression”. Hat tip to this Quote Investigator article on the Henry Ford quote.

… the “push and pull” models of innovation … Di Stefano, Giada, Alfonso Gambardella, and Gianmario Verona. “Technology push and demand pull perspectives in innovation studies: Current findings and future research directions.” Research policy 41, no. 8 (2012): 1283-1295.

… society shapes technology (via Durkheim’s social determinism) … researchers have proposed different models for how society determines technology, including social shaping of technology (SST), social construction of technology (SCOT), and actor-network theory (ANT).

… the “cacophony of pimped profiles” that was MySpace … Gehl, Robert W. “Real (software) abstractions: on the rise of Facebook and the fall of MySpace.” Social Text 30, no. 2 (2012): 99-119.

Coding in natural language: let’s start small

The idea of writing a computer program by writing English (or another natural human language) is attractive because it might make coding easier and faster. This article tells the story of my encounter with natural language programming as a graduate student, and the small working system I built. I discuss the idea of context limiting: we can improve the user experience as well as the system’s performance by having clearly delimited boundaries within which the system operates, rather than replacing code with natural language in arbitrary contexts.

Introduction: why program in natural language?

The history of general-purpose programming language design has been a slow march from languages at very low levels of abstraction (i.e., those which expose details of the underlying machine, such as assembly languages) towards so-called ‘higher level’ languages. The purpose of programming language design, it can be argued, is to make the activity of programming as close as possible to the pure expression of intent. That is, to strip away from programming all concerns that are not related to what the programmer is trying to achieve.

Writing in a natural language, such as English, is close to a pure expression of intent. Yes, the mechanisms by which you write, the language you use, and indeed the fundamental properties of the activity of writing themselves offer resistance and shape intent. But in comparison to programming, writing down what you mean in natural language requires little or no conscious consideration of aspects unrelated to what you’re trying to express in that instant.

This raises an obvious question: can we make it possible to write computer programs in natural language? In order to do so, we would need to develop a system capable of reliably translating natural language statements into conventional programming languages. Traditionally, this problem has been seen as an insurmountable challenge, since natural language is so complex and ambiguous, and computer language so precise.

Advances in deep learning architectures and training have brought us closer than ever to realising natural language programming. OpenAI’s Codex and Deepmind’s AlphaCode are capable of generating correct programs, seeded essentially by a natural language prompt. This technology has already been commercialised as a software development tool in the form of GitHub Copilot. The tool acts as a form of advanced autocomplete, generating programs from natural language comments, completing repetitive lists, automatically generating test cases for code already written, and proposing alternative solutions. Though these systems are far from perfect and there is a lot of work to do yet, it’s awesome to see this technology entering the realm of applicability.

But this isn’t an article about the shiny new deep learning technology, interesting as it is. This article is the story of my own little exploration of building a natural language programming system, which takes place nearly ten years ago.

A problem with writing statistics code

In 2013, I am in the first year of my Ph.D. on developing better interfaces for data analysis. I’m steeped in data analysis myself, having recently left a full-time job as a data scientist, and also fresh with the experiences of writing statistics code to analyse data for several experiments I ran as a Master’s student.

While writing statistics code in the R programming language, I was frustrated that I had to constantly look up documentation to do very simple things. For example, to generate a sample of random numbers from a standard normal distribution, you need the function rnorm. I would keep forgetting this (is it norm, normal, rnormal, randn?) and have to go look it up. If I wanted to vary the parameters of the distribution (mean and standard deviation), I’d have to look it up.

Despite the fact that I knew there was a function that did exactly what I wanted it to do, and I knew all the data the function needed to do its job (the number of samples, mean, and standard deviation), I was still hindered by not knowing the specific name of the function and the names and order of its parameters (the function ‘signature’, in programming parlance). This problem plagued me ceaselessly, several times during a programming session, and each time the trip to search the web for documentation and examples would draw my attention away from my core activity, and disrupt my state of flow.

Why couldn’t I just write: “a random sample of size 100, with mean 0 and standard deviation 1”, and have the system generate the code rnorm(100, mean=0, sd=1), thus saving me a very straightforward round of documentation searching?

I had stumbled across a very specific but nonetheless common class of problem encountered by programmers. I didn’t give it a name in 2013, but I shall do so now (mostly for convenience of reference, but perhaps a little for vanity). I call it the familiar invocation problem. A programmer is facing the familiar invocation problem when their situation has the following properties:

  1. They know a function exists that will solve their needs
  2. They have all the information (arguments) the function requires
  3. However, they cannot recall the function signature (name and order of arguments)
  4. Nonetheless, they can verify by sight whether a specific bit code is what they needed. That is, they must already be familiar with usage of the function; they have used it or looked it up before.

Criteria 1 and 2 are knowledge prerequisites, criterion 3 introduces the problem. It is 4, the familiarity criterion, that really makes this entire approach plausible: being able to recognise correct solutions and identify incorrect solutions from memory is what will save the programmer the trip to the browser to look up documentation.

Rticulate: my natural language programming system from 2013

Having identified the familiar invocation problem, I set about building a proof of concept. I started precisely in the domain that had kindled my frustration: statistical programming in the R language. I named the system Rticulate, pronounced ‘articulate’.

Rticulate is a simple mechanism. I built an annotated dictionary of R functions. For each function, this dictionary contained the function’s name and the number, names, and types of its parameters, but it also contained synonyms and related terms for each. So, for example, the entry for rnorm contained the related words “random”, “normal”, and “distribution”, among others. While I initially built this dictionary by hand, I proposed that the process could be automated by mining documentation, as well as the words people use to describe the function on fora such as Stack Overflow.

Example entries from the Rticulate dictionary. Here three entries are shown, for the functions choose, rnorm and log. For each function and its arguments, the dictionary contains a type and synonyms. Arguments additionally have default values.

When the user enters a query, the system first matches the query to a function in the dictionary. This is, again, implemented quite simply: it treats both the query as well as the dictionary as a bag-of-words, and looks for the function that has the most terms in common with the user query.

Once the target function is found, the next challenge is to match values in the query to the intended arguments. Consider the query: “5 normally distributed samples with mean 6 and deviation 4”. This needs to be resolved to rnorm(n=5.0, mean=6.0, sd=4.0) . The system scans the query to identify likely argument values (5, 6, 4) and likely references to arguments (“samples”, “mean”, “deviation”). Next, it matches likely values to likely parameters, using an optimisation algorithm to find an assignment that minimises the distance from each likely value to its likely parameters. In the example above this is straightforward: the words “mean” and “deviation” are right next to the values 6 and 4. The word “samples” is equally distant from 5 and 6, but if we assigned 6 to “samples”, we’d have to assign either 4 or 5 to “mean”, thus greatly increasing the overall distance of the assignment.

The optimisation process relies on a variant of the proximity principle, a linguistic idea which loosely states that words that are related to each other should appear close to each other in text. Here I interpret it as “argument names should appear close to their values”. It seems like a huge oversimplification, but it works quite well in practice, and the proximity principle is actually the basis of a lot of successful natural language and information retrieval systems.

If the system fails to match a parameter, it uses a default value. Some R functions already have default values for their parameters, and the Rticulate dictionary inherits these values. For others, Rticulate adds new default values. This was my attempt to make the system return useful output in more cases, and mitigate the need for criterion 2 (i.e., that the user needs to know parameter values to use the system).

Examples of natural language statements and their corresponding function invocations as generated by the Rticulate prototype. Note how the binomial distribution query does not contain a value for the size parameter, but due to universal default values, Rticulate can return a complete invocation of the function rbinom.

In a report written early during my Ph.D., I wrote:

Rticulate is motivated by the increasingly common phenomenon of “Google engineering”, i.e., the process of programming by searching for snippets of code online. Formulating a search query that yields the appropriate result is a difficult task for novice programmers. Moreover, many novice R programmers are also learning R in conjunction with learning statistics, as a first programming language. The fact that R is not designed to be a first programming language makes the learning curve much steeper than necessary. Thus, Rticulate aims to provide a natural language interface for the R programming language.

The Rticulate prototype is currently capable of taking free-form natural language input and formulating a function invocation that it believes best represents the input, drawing on a small, manually-annotated function dictionary.


The techniques it uses to formulate this invocation are fairly simplistic, and there is no sophisticated natural language processing being used. Furthermore, it does not integrate at all with the R language; once Rticulate has produced a function invocation it must be manually copied into R. Furthermore Rticulate must be manually made aware of environmental variables and their types so that it can extract references to them from input sequences. Despite these limitations, even at the current stage the prototype provides a compelling demonstration of the potential utility of such a tool as a learning scaffold for the R language.

I was pretty excited about this line of work, but ultimately it was abandoned in pursuit of other projects. In that same report, I proposed four systems, but abandoned one entirely, took one forward only partially, and only seriously developed the other two.

Ph.Ds are often like that. It is rarely the case that you end up doing what you set out to do. You can set your sights on, and direct your attention towards, a certain goal. But then your research agenda is buffeted by practical constraints, unexpected hurdles, serendipitous opportunities, new developments in the field, and indeed, emergent findings from your research itself. This is not just a property of Ph.Ds but of all academic research. As Einstein put it, “If we knew what it was we were doing, it would not be called research, would it?”

Natural language programming: from all context to small context

During my Ph.D., I had to end my explorations of natural language programming early, both because of time constraints but also because I became increasingly enchanted with the power of interactive data visualisations, which I saw as potentially solving a much wider range of problems.

Nonetheless, my interest in natural language as an alternative programming representation endures. I occasionally find myself in a position to study it. A recent investigation led by a student of mine advances the discussion on a question I always had: whether it is better to have “full” natural language in such systems, or only a reduced subset of natural language (e.g., restricted vocabulary and grammar), both to make the interpretation of the statement easier on the system but also to make the experience more predictable and consistent for the user. In our experiment, we found several benefits to using a reduced subset of natural language and provide evidence to suggest that “full” natural language may not be ideal in many cases. If you’re interested to learn more, there’s a blog post and a paper.

What both Rticulate and my student’s work have in common is the limiting of context as a technique. Rticulate limited the scope of the system specifically to familiar function invocation. The later work experimented with limiting vocabulary and grammar. Human language is saturated with context that enables us to disambiguate between many possible interpretations of an utterance. In comparison, there’s far less common ground between us and a machine interpreter.

When systems purport to do everything and understand anything, users encounter a host of problems. Research by my colleagues finds that the experience can be “like having a really bad PA”. The new code generation systems I mentioned at the start of this article, at present, can suffer from both these problems. The power of deep learning systems like Codex make my tiny, laboriously hand-coded attempts with Rticulate look laughably quaint by comparison, but I believe that Rticulate’s simplicity and focus on solving a very specific problem would still be a strong selling point in favour of such a system today.

The core idea behind context-limiting is that in order to reduce the uncertainty the user faces when dealing with a natural language system, as well as improve the performance of the algorithm, start small. Start either with a limited set of contexts in which the system can operate, or with a limited language of instruction, or both. In due course, perhaps, we can find a way to create sufficient common ground between programmer and algorithm that we can “converse” with the same ease as we do with human interlocutors. However, that is still a few years (at least) away.

Closing reflection: research is a really long game

I don’t mean to give the impression that by building a system all these years before the current renaissance, I somehow invented the idea of natural language programming or of context limiting. The idea of programming in natural language has been around for as long as programmable electronic computers, perhaps longer. There has been a lot of great work on this topic. Debates about the suitability of natural language as a programming language, and ways in which we might solve the apparent technical and interactional challenges (including context limiting), have been floating around since at least 1966. Even Edsger Dijkstra waded in, in his typical curmudgeonly fashion, to state his position on the topic (surprise, he hates it).

Bill Buxton postulates the long nose of innovation: “any technology that is going to have significant impact in the next 10 years is already at least 10 years old”. The history of several important technologies such as the mouse, RISC processors, and capacitive multitouch, involved an incubation period of 20-30 years between the invention of the technology and its first consequential application. The challenge of innovation, Buxton argues, is not (just) in the invention: it is the refinement, persuasion, and financing that follows which will determine its success. He gives the following metaphor:

The Long Nose redirects our focus from the “Edison Myth of original invention”, which is akin to an alchemist making gold. It helps us understand that the heart of the innovation process has far more to do with prospecting, mining, refining, goldsmithing, and of course, financing.

Knowing how and where to look for gold, and recognizing it when you find it is just the start. The path from staking a claim to piling up gold bars requires long-term investment, and many players. And even then, the full value is only realized after the skilled goldsmith has crafted those bars into something worth much more than their weight in gold.

Thanks to advances in the technology of generative models, the long nose of natural language programming is finally beginning to poke into the present. For researchers at the intersection of human-computer interaction, artificial intelligence, and programming languages (like myself), this is a tremendously exciting time. There are still many open questions about the experience of interacting with these generative models, and a lot of goldsmithing ahead of us.


Arawjo, Ian. “To write code: The cultural fabrication of programming notation and practice.” In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1-15. 2020.

Ong, Walter J. Orality and literacy. 1982.

Sarkar, Advait. Interactive analytical modelling. No. UCAM-CL-TR-920. University of Cambridge, Computer Laboratory, 2018.

Sarkar, Advait, Neal Lathia, and Cecilia Mascolo. “Comparing cities’ cycling patterns using online shared bicycle maps.” Transportation 42, no. 4 (2015): 541-559.

Sarkar, Advait. “The impact of syntax colouring on program comprehension.” In PPIG, p. 8. 2015.

Csikszentmihalyi, Mihaly, and Mihaly Csikzentmihaly. Flow: The psychology of optimal experience. Vol. 1990. New York: Harper & Row, 1990.

Givón, Talmy. “Iconicity, isomorphism and non-arbitrary coding in syntax.” Iconicity in syntax (1985): 187-219.

Behaghel, Otto. “Deutsche syntax, vol. 4.” Heidelberg: Winter (1932).

DiGiano, Chris, Ken Kahn, Allen Cypher, and David Canfield Smith. “Integrating learning supports into the design of visual programming systems.” Journal of Visual Languages & Computing 12, no. 5 (2001): 501-524.

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Hunches and Sketches: rapid interactive exploration of large datasets through approximate visualisations.” In The 8th international conference on the theory and application of diagrams, graduate symposium (diagrams 2014), vol. 1. 2014.

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Interaction with Uncertainty in Visualisations.” In EuroVis (Short Papers), pp. 133-137. 2015.

Sarkar, Advait, Mateja Jamnik, Alan F. Blackwell, and Martin Spott. “Interactive visual machine learning in spreadsheets.” In 2015 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 159-163. IEEE, 2015.

Sarkar, Advait, Martin Spott, Alan F. Blackwell, and Mateja Jamnik. “Visual discovery and model-driven explanation of time series patterns.” In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 78-86. IEEE, 2016.

Mu, Jesse, and Advait Sarkar. “Do we need natural language? Exploring restricted language interfaces for complex domains.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-6. 2019.

Luger, Ewa, and Abigail Sellen. “” Like Having a Really Bad PA” The Gulf between User Expectation and Experience of Conversational Agents.” In Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 5286-5297. 2016.

Halpern, Mark. “Foundations of the case for natural-language programming.” In Proceedings of the November 7-10, 1966, fall joint computer conference, pp. 639-649. 1966.

Sammet, Jean E. “The use of English as a programming language.” Communications of the ACM 9, no. 3 (1966): 228-230.

Dijkstra, Edsger W. “On the foolishness of” natural language programming”.” In Program construction, pp. 51-53. Springer, Berlin, Heidelberg, 1979.

What if charts could control your data?

We typically think of charts as the end result of data analysis. To create a chart in Excel, you must first select some data. To produce a chart in Python or R using charting libraries, you must provide an array, data table or data frame. When William Playfair was inventing the line and bar charts in his Commercial and Political Atlas (1786), he conceived of them as ways of visualising economic data to give his readers greater understanding. Florence Nightingale’s beautiful rose diagrams in the 1850s were invented to visualise Crimean war mortalities (and in particular, how most of them were due to preventable disease, and not the war itself) as a rhetoric device in her quest to improve hygiene. In the same decade, John Snow’s cholera maps visualised the locations of disease clusters, helping authorities see the connection to the broad street water pump and the tainted water of the Lambeth Waterworks Company.

Nightingale’s rose diagram. Source: Wikimedia commons

The idea that charts must be produced from data, as a way of depicting pre-existing data, is deeply ingrained in our tools and our historical uses for charts. This one-directional movement from data to charts makes intuitive sense, since how can a chart have any meaning in the absence of data? How could we even construct it?

But what if I told you it could go the other way: that the chart could come first, and be used to produce and control data? There are, in fact, some scenarios in which it is possible and useful to do this, and in this article I will share two examples.

Charts can generate data that improves human communication

At companies like BT, data analysts often have conversations with non-analysts. The non-analyst typically has a request. In the simplest cases, the request is just for data; the analyst is asked only to gather it and send it without further analysis. In other cases the non-analyst may request an insight, where the analyst is asked to investigate a specific question or hypothesis. Finally, the analyst might be asked to build a model for use by the non-analyst.

In these conversations, the analyst tries to achieve clarity about the request and create a shared understanding of the work that will be done, outcomes to be expected, and potential problems and limitations. Often the non-analyst’s first request is missing details. In guiding the non-analyst to refine a question, the analyst shares and exercises their domain and statistical knowledge.

The problem is that the data isn’t always available during these discussions. In large organisations, the data relevant to a particular request might be spread across many different files, tables, databases, and reside with different users and teams. It might require “cleaning”: analyst-speak for dealing with incorrect values, duplicates, missing values, and other data quality problems. Some data might require requisition forms and ethics approval processes before the analyst can even look at it. There are therefore many overheads in finding the right data to start analysis.

Not having data to look at can cause the discussion between analyst and non-analyst to suffer. It is much easier to achieve a shared understanding of the problem and the work that needs to be done if both participants in the conversation can see a common visual aid.

We envisioned a tool for both participants to sketch out and produce data during the conversation to facilitate the conversation. The analyst can construct a chart by dragging, dropping, and editing individual shapes called “kernels” onto a shared canvas. Think of kernels as building blocks of different shapes, such as lines, periodic waves and peaks that can be put together to create a dataset with any shape you want. As the analyst builds a chart, we generate data to match. Meanwhile, the non-analyst has access to annotations: they can draw arrows, speech bubbles, lines, and circles on the canvas to make their questions more specific to a certain aspect of the data. As the analyst shapes the dataset over the course of the conversation, and the non-analyst adds annotations, snapshots of the canvas are saved. These snapshots serve as a “graphical history” of the conversation, and enable users to reflect on how their thinking has developed, as well as refer and return to earlier ideas.

The tool in use. (A) The chart with the generated data points (B) The tool panel containing building block shapes
and annotations (C) The shape editor (D) The time axis (E) Snapshots of previous versions. Source: Mărășoiu et al., 2016

In a study, we asked analysts to use this tool to create datasets based on short textual descriptions of the data. We found that participants were able to use the tool quickly and that the kernel composition approach was intuitive. Moreover, participants really saw the value of such a data sketching tool in this conversations, saying, for example: “One of the biggest challenges I have in communicating to a client is to create data to try and describe what I’m expecting to see, so actually being able to manipulate the dataset like that is a really useful concept.”

Controls on charts can control the underlying data

Charts can also help us analyse large datasets, where due to the size of the data we cannot analyse all of it precisely and therefore resort to approximation. Analysing very large datasets can be challenging due to the hardware requirements involved. While modern consumer laptop and desktop computers are very fast, they are still often not powerful enough to give interactive performance on large datasets.

Consider a small business owner, perhaps a café, trying to look at 5 years of point-of-sales data to figure out which type of coffee is selling the most. If she sold 200 items a day, 300 days a year, that’s 300,000 data points to analyse. It is a large dataset to store and it will take a long time to compute the answer on a consumer PC, perhaps several minutes. Now imagine if that wasn’t the only question she had about this data, but was trying to interactively analyse it, by charting different views and drafting quick formulas. Such a workflow would be completely prohibited by a several-minute-long delay after each keypress or mouse click.

Large companies work around this by buying powerful hardware, storing their data in specialised database software in large data centres, and hiring expert data management professionals. But those solutions are out of reach for most small businesses and private individuals, who have a spreadsheet, a laptop, and a question.

There is a technique that can help: approximation. A family of algorithms known as probabilistic algorithms can provide near-instant results in exchange for a small, quantifiable chance of error, and these can be used to answer certain types of analytical questions. A more general-purpose approximation technique is sampling. Consider the café owner we met earlier. Instead of trying to analyse all 300,000 points, she could instead select a random sample of, say, 3,000 data points and compute the best-selling coffee in that sample. The estimate she would achieve this way does have a small chance of error, but through statistics it is possible to quantify the margin of error. She could repeat the process with multiple samples, or take a larger sample if she was unsatisfied with the level of accuracy in the estimate. This may not be necessary: if the estimate shows that the best-selling coffee vastly outsells the second-best-selling coffee, as is often the case in real-world data, then the difference between the two proportions is likely to be outside the error margin.

In practice, using approximation techniques is difficult, both because consumer analytics tools do not support them very well, but also because they have complex interfaces and many parameters. In order to use these techniques, the analyst needs to be able to assess the uncertainty in the estimate and use that assessment to decide what they will do next, whether they will seek to refine the estimate or whether it is good enough for their current purposes. Estimates of uncertainty are often visualised as error bars on scatter plots and bar charts, which are lines that indicate a window within which the value is likely to fall. The larger the window, the greater the uncertainty.

To make these techniques easy to understand and use, we thought of using the error bars themselves as a mechanism for controlling the estimation process. In our interface, the user drags the ends of error bars when they wish to reduce the uncertainty associated with a particular estimate. When dragging, a horizontal indicator appears, showing how long it will take to recompute. This “resource cost estimation” bar allows the user to judge whether they are willing to invest their resources (in this case, time) in exchange for an improvement in accuracy.

The user can request a specific amount of uncertainty by dragging the bars varying amounts. When they reach the desired level of uncertainty, they stop dragging, which starts the recomputation. While recomputation is performed, the cost estimation indicator shrinks to reflect the time elapsed, and the point and its error bars move to their newly accurate position.

A sequence of images showing how the interface can be used to reduce the uncertainty associated with a point. Source: Sarkar et al., 2015

We ran an experiment where 39 participants without formal training in statistics each used the interface to answer a series of simple statistical comparison questions. In each question, two points with error bars were given, and the participant was asked to describe the relationship between them: whether one was higher than the other, whether they were equal, or whether it was not possible to tell. The error bars were draggable, so the participants could refine the estimates before giving an answer.

We found that most participants successfully discovered the drag operation without it being explained beforehand; from this we can conclude that the interface corresponds well with users’ prior assumptions about how one might manipulate uncertainty in a visualisation. We found that most participants used drag operations throughout the experiment to give reasonable and correct answers about the data. Importantly, they gave answers that they could not have given without using the interface, because the initial uncertainty was too high to draw certain kinds of inferences, until the participant deliberately reduced it.

We also varied the estimated recomputation time so that in some tasks, it would take as little as 3 seconds to recompute the point with perfect certainty, but in others it would take up to 30 seconds. We found that this duration was negatively correlated with the amount that participants would choose to reduce uncertainty. Thus, the interface successfully communicated to users the implications of dragging the error bars by differing amounts, and led them to make informed decisions about how much computation they were willing to spend in refining their estimates.


We typically think of charts as coming after data, and as a way to visualise data that already exists, but that approach can be limiting. Charts can in fact be used to produce data, and also as a way to help control data. Our experiments have shown how data-producing charts can help analysts have better conversations with their colleagues, and how data-controlling charts can help non-experts interact with very large datasets on consumer hardware using sampling and approximation techniques.

And now, a summary poem:

Can there be count without account?
Or sense without amount?

But a baseless word is oft misheard
And time exchanged for smaller range.

We can know before we see,
and see before we know.


Mărăşoiu, Mariana, Alan F. Blackwell, Advait Sarkar, and Martin Spott. “Clarifying hypotheses by sketching data.” In Proceedings of the Eurographics/IEEE VGTC Conference on Visualization: Short Papers, pp. 125-129. 2016. Available at:

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Interaction with Uncertainty in Visualisations.” In EuroVis (Short Papers), pp. 133-137. 2015. Available at:

The fundamental value of the metaverse is sensory misdirection, not replication

The “metaverse” is the collective marketing term for a set of virtual reality media experiences. It is accessed using headsets such as the Oculus Quest, Valve Index, and HTC vive. It is often presented in marketing materials as newly enabling the digital replication of physical space, despite the fact that this has always been possible using normal (i.e., not head-mounted) displays, and the replication is often a worse experience. The metaverse is also being used, maliciously, to promote cryptocurrency scams to uninformed consumers.

I argue that we are missing the true promise of virtual reality. By doing one thing in the physical world while seeing another in the virtual world, we can trick our brains with sensory misdirection. We can exercise more, eat more healthily, and become more imaginative. In this article I share a few examples from the bounty of scientific evidence showing how virtual reality can hack our brains for our benefit.

What is the metaverse?

Think of it as a new web, but for virtual reality and accessed through virtual reality devices such as Facebook’s (i.e., “Meta’s”) Oculus headset. It is a collective term for the set of connected experiences we might have through these devices. Note: the term is not well defined. It is not a scientific term and is not widely used in serious academic research on virtual and augmented reality. It is a colloquial, commercial, marketing term, promoted by a set of corporations.

The term comes from Neal Stephenson’s 1992 science fiction novel “Snow Crash”, in which the metaverse is a virtual space, using the metaphor of the physical world, where people interact with each other as virtual avatars. A single firm holds a monopoly over the virtual ‘real estate’, which is available to be bought and developed.

Stephenson’s vision of the metaverse, while extraordinary and fascinating, appears to have had two problematic influences on how we conceive of the value of VR: the focus on recreating the real world, and the opportunity for capitalistic exploitation.

Problem 1: the way the metaverse is presented places a misleading emphasis on the replication of the physical world

Many large companies have publicly affirmed their serious commitment to building the metaverse, including NVIDIA (who makes graphics processors for computers), Epic (who develop the Unreal game engine and the popular game Fortnite), and of course Facebook (now Meta). But what exactly is it that they envision we will do in the metaverse?

Well, mostly meet and hang out. Facebook has presented a VR meetings app for work, a social space, and a VR chatting app. Video envisionments of the metaverse depict social meetings, glitzy clubs, marketplaces, and large social gatherings such as concerts. NVIDIA imagines that “we will buy and own 3D things like we buy 2D songs and books today. We will buy, own, sell homes, furniture, cars, luxury goods and art in this world. Creators will make more things in virtual worlds than they do in the physical world.”

This sounds great, except it makes it seem like these visions are new and unique to the metaverse. They are neither new nor unique. All these applications are already possible, and indeed thriving, in non-VR settings. Second Life, which has been running since 2003, has been used for art exhibitions, live music and theatre, religious meeting houses, scientific collaboration, virtual workplaces, as an educational platform, and even hosts embassies for several countries including Sweden, Estonia, and the Philippines. It has a thriving virtual economy, estimated to be worth over half a billion US dollars, based on trading digital goods and services in its digital currency (the “Linden Dollar” — at the time of writing, a US dollar is worth 320 Linden dollars). Second Life is far from the only example. Jagex’s 2001 game RuneScape also developed social uses and its own economy. For contemporary examples one need look no further than the spectacular successes of Minecraft, Roblox, and Fortnite at creating versatile digital spaces for people to commune.

Socialising in Second Life. Source: Wikimedia Commons

All these pre-existing platforms look and work exactly like the visions of VR worlds presented to us in metaverse promotional videos. It’s just that we look at them on a screen on our desks instead of on a screen strapped to our heads.

One can object that strapping the screen to your head is not a trivial difference, that VR creates a different level of immersion, and that is indeed true. However, do these applications demonstrate a fundamentally new and unique quality of virtual reality technology being put to good use? I would argue that they do not. First, VR being ‘more immersive’ is a difference in degree, and not in fundamental quality. It is envisioning a use case for VR as a ‘faster horse’, not an automobile.

Second, VR is fundamentally worse than screens in many situations. You can do things with screens that you can’t do in VR. You can use screens for long periods of time without fatigue. You can use multiple screens simultaneously for different purposes. You can retain peripheral awareness of things not in VR, such as keeping an eye on a dog or a child or a bubbling pot. You can engage in intellectual activity without coupling it to physical (in)activity, by using a treadmill desk, or going for a walk outside while taking a phone call. You can use screens while using other things: reading a physical book while taking digital notes, playing the guitar while watching a tutorial, or watching the TV while knitting.

Yes, some limitations may be mitigated with better VR and AR technology, but it may take years and there is no guarantee. Moreover, some experiences have much better expressions in digital space than the mere replication of the physical interface. Email in VR is going to look exactly like email on your screen. Reading documents and webpages will be the same. Watching videos will be the same. There is no point reinventing these experiences with physical metaphors. One can argue that there’s nothing stopping us from recreating screen-like interfaces in VR either, but what is the point of strapping a screen to your head in order to recreate (badly) the experience of a screen on your desk?

Problem 2: the metaverse places a problematic emphasis on crypto-speculation.

Perhaps unsurprisingly, the metaverse has turned into a gold rush, providing plenty of ammunition for scams and schemes. Conflation and confusion are the tools of the con man’s trade. And what better confusion than cryptocurrency, NFTs, DAOs, and Web3?

Consider Decentraland. It is a virtual world where people pay real money (like, nearly USD $1,000,000) for digital ‘parcels of land’ on which they can construct buildings and offer services. The Wikipedia article for Decentraland has this delightful description: ‘Users may buy virtual plots of land in the platform as NFTs via the MANA cryptocurrency, which uses the Ethereum blockchain.’

NFTs! Cryptocurrency! Blockchain! The density of these buzzwords alone is enough to make one wonder whether it makes sense to ‘invest’ in ‘land’ in the metaverse. Of course, it does not. You can build a pretty virtual building on it, and it therefore has recreational value, but in that sense you have bought a (very expensive and bad) game, not land. You can’t live on that ‘land’. You can’t grow crops on it, or raise livestock, or mine it. And it can disappear if anything happens to Decentraland or its servers. Besides not-land, you can spend real money on other not-things, including clothes, accessories, and even names. Not nice names either: the community failed to vote against allowing the character name ‘Hitler’, and at one point the name ‘Jew’ was on sale for USD $362,000.

So why do people spend this kind of money on literally not-things? Because they believe they may ultimately turn a profit by duping someone else into spending more. It is transparently a game of greater fools. While Decentraland is currently used primarily by a small number of hobbyists and speculators, it sets a problematic precedent. There is no real scarcity of such digital assets. Any company can start a new Decentraland, offering slices of not-stuff for whatever you’re willing to pay.

The game-like quality of these experiences makes them attractive to casual consumers, and the veneer of digital ‘goods’ makes them seem like legitimate marketplaces. Further legitimacy is conferred when familiar names like Sotheby’s wade in to ‘invest’ too. But make no mistake: this is gambling and speculation in disguise. And when you can enter at any price, gambling becomes a way of parting people of all socioeconomic strata from their money, but especially takes a toll on the poor and working class.

Dear reader, a word of advice: if someone is trying to sell you something, and you hear the word ‘blockchain’, run.

Sensory misdirection is the fundamental value proposition of virtual reality

So far we have seen that VR cannot add fundamentally new value to our lives through the applications typically presented (socialising, trading, and speculation), and may even be a step backwards. What is VR good for then?

It turns out that hijacking the entire visual field allows us to play some pretty impressive tricks with the brain. Academic research has explored and demonstrated that virtual reality can be used to help you exercise more without becoming fatigued, eat less and still be satisfied, rehabilitate patients of degenerative illnesses, and even help children with autism develop their sense of imagination in play. How is this possible?

VR can help us exercise better

Researchers at the University of Sydney conducted an experiment where participants played a variety of different virtual reality exercise games (‘exergames’) while having their heart rate monitored. These included Fruit Ninja, where the objective is to slice through falling fruit using a virtual sword; Hot Squat, where you squat your way repeatedly through an obstacle course; Holopoint, where you are an archer shooting enemies coming at you from all directions and dodging their projectiles; and Portal, where you move from room to room solving puzzles.

If you have never played a VR game before, it might help to know that playing VR exergames is not like playing traditional computer games. You move in the physical world in a fashion similar to the activity you are doing in the virtual world. To slice a falling fruit, you swing your arm through the air. To dodge an enemy arrow, you step aside or lean back.

I have no idea what this game is, but it looks like a good workout. Source: Flickr.

After participants played these games, they completed a questionnaire about their perceived exertion, i.e., how hard they thought they had exercised. The researchers then compared this subjective self-assessment of exertion to their heart rate readings, an objective measure of their true exertion. They found that for Hot Squat, which was relatively monotonous and engaged the large muscles of the legs, participants felt they were exercising harder than they actually were. But for Fruit Ninja, Holopoint and Portal, which engaged the body more holistically and which offered interesting and varied movements, participants underestimated how much they had exercised. Moreover, they reported enjoying the exercise more. Other studies have also found similar results: reduced perceived effort, increased enjoyment.

Virtual reality exercise requires physical and cognitive work simultaneously, and therefore promotes neuroplasticity. Several studies show that VR exergames are better at promoting cognitive gains than isolated physical activity. This result has been found across different age groups. It has been found both in clinical settings, where the exergame is being used as part of a physiotherapy or rehabilitation program, as well as non-clinical settings, i.e. everyday exercise.

VR can help us eat better

Our experience of eating is deeply bound up in what we see. The feeling of satiety, or fullness after a meal, is ambiguous. VR can take advantage of this.

Researchers at the University of Tokyo ran an experiment where participants ate cookies in the real world, and were shown themselves eating cookies in VR. However, the cookies in VR were either shrunk, so that they appeared two-thirds the true size, or enlarged, so that they appeared one and a half times the size. Participants were asked to eat until they felt satisfied. When participants were seeing themselves eat enlarged cookies, they ate significantly less than when they were seeing themselves eat shrunk cookies, and reported feeling less hungry afterwards.

The sensory misdirection potential of VR in food can go further than changing how full we feel; it can even make us believe we are eating something we are not. Food marketing has long known the importance of colour in consumer preferences. Did you know, for example, that the flesh of commercially farmed salmon is naturally grey, and the pink colour is added artificially to replicate the appearance of wild-caught salmon? Or that 19th century butter manufacturers lobbied for margarine to be dyed pink so that it would be less attractive to consumers?

These tricks can now be played in VR, altering the appearance much more dramatically and without needing to alter the food itself. In another experiment, researchers at ETH Zurich showed participants chocolate cake in VR, while in reality they were eating a lemon cake. Nearly a third of them identified the taste of chocolate in what they had eaten. Another experiment by researchers at Aarhus and York Universities found that participants who saw light coloured coffee in VR while drinking black coffee rated it as creamier in comparison to when they drank the same coffee, but saw black coffee in VR. Researchers at the University of Tokyo found that the illusion can be made even stronger with accompanying olfactory (smell) misdirection: participants eating a plain cookie, but seeing and smelling a different kind of cookie, experienced a change in the cookie’s taste in 80% of the trials.

Materials from the ETH Zurich study. Source: Flickr.

Virtual reality can train the mind

Through diet and exercise, VR can improve our bodies. But, perhaps more profoundly, VR experiences can also improve our minds.

When young children play, they often pretend that objects are something else. A banana is a telephone, a sofa is a fortress, a stick is a sword. Besides object substitution, pretend play can also attribute pretend properties to an object (e.g., the cracks in the pavement are dangerous), or invent imaginary objects (e.g., feeding imaginary food to a doll). Rather than being frivolous or fanciful, pretend play lays important foundations for symbolic and abstract thinking.

Unfortunately, children on the autistic spectrum often find it hard to engage in imaginative play, and this can cause developmental delays or permanent inhibition. My colleague Zhen Bai, with whom I shared an office during my PhD, conducted a fascinating series of experiments showing how an augmented reality system could facilitate pretend play in children on the autistic spectrum.

Zhen built a system that operated like a “magic mirror”. Children sit in front of a screen that shows a video feed from a camera pointed at them, essentially like a mirror. The magic is when you bring one of a set of special wooden blocks into the field of view. The blocks are plain apart from stickers bearing a small marker code, like a bar code or QR code. The system can recognise these markers and overlay the image of something else in the video feed. So the child can hold up a block but see themselves holding up a car. A block can become a train, an airplane, a railway station, a school, a bridge, a traffic light, a ball of fire.

Pretend play with the magic mirror. Source: Zhen Bai.

This is not a true VR system, it is more accurately described as AR (‘augmented reality’) and no headsets are worn.

In multiple experiments with different versions of this system, Zhen demonstrated that playing with such an AR system encourages children on the autistic spectrum aged 4-7 to produce symbolic play more frequently, and for longer, compared to equivalent play without computer assistance. The most positive effects were experienced by children with the most developmental delay in symbolic play. With typically developing children too, there were benefits: a variation of the system focused on augmenting virtual characters with emotional states (i.e., giving them faces that looked happy, sad, angry, etc.) was found to promote social symbolic play. In using this system, typically developing children were encouraged to express the emotional states of pretend roles in a social context, explain the cause and effect relationship of emotional states, and communicate with playmates to construct joint pretense.

The applications are not limited to children. Surveys and meta-analyses of dozens of experiments have shown that virtual reality is effective for neurorehabilitation. Virtual reality therapy, when combined with conventional therapy, is more effective in improving the balance of post-stroke patients than conventional therapy alone. It is better for improving the upper-limb function of stroke patients than conventional therapy. It is more effective and faster than conventional therapy for dementia patients. When used in conjunction with conventional therapy, VR therapy can improve the effectiveness of treatment for traumatic brain injury, Parkinsons, and multiple sclerosis.

In each of these latter applications, the immersive nature of virtual reality is important. Strapping on a headset and replicating a flat virtual screen inside of VR is not sufficient to gain these benefits. It is the fundamental value of immersive VR that it can directly exploit and affect the wiring of our brains.

Of course, this same power can also be misused. This article aims to showcase some of the more beneficial applications. It is not difficult to imagine opportunists wanting to harness the neurological impact of VR to manipulate its users in the same way that they use the psychology of persuasion and gambling addiction to retain users today.


Virtual meeting rooms, social spaces, nightclubs, land, cities, and economies are fun to imagine, and they may yet become interesting and useful. But ultimately these are games and recreation at best, and scams at worst. Recreation is wonderful, of course, but it is not something fundamentally new that we couldn’t do before the metaverse; it is just a different way of having fun. To me, focusing on these applications, and making them the core of how we present and talk about the metaverse, is completely missing the point of the amazing potential of this technology.

VR exergames can create a gap between actual and perceived exertion that can be harnessed to improve our fitness. We can exercise more while believing we have exercised less, having promoted our cognitive function, and having enjoyed ourselves more. I find that amazing.

VR misdirection can create a gap between actual and perceived consumption that can be harnessed to improve our diets. We can eat less while feeling fuller. We can consume foods with fewer calories (e.g., a black coffee) while tasting foods with more calories (e.g., a coffee with cream). I find that remarkable.

VR therapies can create a gap between actual and perceived environments that can be harnessed to improve our minds. We can use VR as a scaffolding to build our imagination, or to create rich sensorimotor stimulation to prevent and reverse the decline of the mind in old age. I find that miraculous.

These are the applications of the metaverse we should be getting excited about. This is what is truly new and fundamentally valuable about VR. This is a new frontier in our relationships with our brains and bodies.


Costa, Marcos Túlio Silva, et al. “Virtual reality-based exercise with exergames as medicine in different contexts: A short review.” Clinical practice and epidemiology in mental health: CP & EMH 15 (2019): 15.

Yoo, Soojeong, et al. “Evaluating the actual and perceived exertion provided by virtual reality games.” Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 2017.

Mestre, Daniel R., Marine Ewald, and Christophe Maiano. “Virtual reality and exercise: behavioral and psychological effects of visual feedback.” Annual Review of Cybertherapy and Telemedicine 2011 (2011): 122-127.

Zeng, Nan, Zachary Pope, and Zan Gao. “Acute effect of virtual reality exercise bike games on college students’ physiological and psychological outcomes.” Cyberpsychology, Behavior, and Social Networking 20.7 (2017): 453-457.

Ammann, Jeanine, Michelle Stucki, and Michael Siegrist. “True colours: Advantages and challenges of virtual reality in a sensory science experiment on the influence of colour on flavour identification.” Food Quality and Preference 86 (2020): 103998.

Narumi, Takuji, et al. “Augmented perception of satiety: controlling food consumption by changing apparent size of food with augmented reality.” Proceedings of the SIGCHI conference on human factors in computing systems. 2012.

Wang, Qian Janice, et al. “A dash of virtual milk: altering product color in virtual reality influences flavor perception of cold-brew coffee.” Frontiers in Psychology 11 (2020): 3491.

Narumi, Takuji, et al. “Meta cookie+: an illusion-based gustatory display.” International Conference on Virtual and Mixed Reality. Springer, Berlin, Heidelberg, 2011.

Bai, Zhen. Augmented Reality interfaces for symbolic play in early childhood. No. UCAM-CL-TR-874. University of Cambridge, Computer Laboratory, 2015.

Mohammadi, Roghayeh, et al. “Effects of virtual reality compared to conventional therapy on balance poststroke: a systematic review and meta-analysis.” Journal of Stroke and Cerebrovascular Diseases 28.7 (2019): 1787-1798.

Henderson, Amy, Nicol Korner-Bitensky, and Mindy Levin. “Virtual reality in stroke rehabilitation: a systematic review of its effectiveness for upper limb motor recovery.” Topics in stroke rehabilitation 14.2 (2007): 52-61.

Maggio, Maria Grazia, et al. “The growing use of virtual reality in cognitive rehabilitation: fact, fake or vision? A scoping review.” Journal of the National Medical Association 111.4 (2019): 457-463.


I received valuable feedback on drafts of this post from friends and family.


I am affiliated with the University of Cambridge, and with Microsoft, which has interests in AR, VR, and the metaverse. All views in this post and website are mine alone and are written in an independent capacity. This post and website do not reflect the views of any individuals at Microsoft or at the University of Cambridge. This post and website do not reflect the views of Microsoft Corporation, Microsoft Research Ltd., or the University of Cambridge.

How my online gaming addiction saved my Ph.D.

Or, how I cookie-clicked my way to a doctorate in interaction design.

It’s been 5 years since I finished my Ph.D. on user interfaces for machine learning. To celebrate/commiserate, I’m sharing an unusual (if I may say so myself) grad school war story, the story of how sinking hundreds of hours into pointless online games unexpectedly sped up my ability to do research by a factor of 10, perhaps more.

The idle game renaissance of 2013

The year is 2013. The slow burning Disney super hit Frozen is picking up steam, and soon parents the world over will be terrorised by toddlers in Olaf and Elsa costumes tunelessly belting ‘Let it Go’. Simultaneously, a French programmer known only as Orteil (meaning ‘toe’) releases a game that takes the online gaming world by storm for its absurdity, simplicity and addictiveness.

The game? Cookie Clicker. The objective of the game is simple: get as many cookies as you can. You can click a giant cookie to generate a cookie, but this method doesn’t scale particularly well. Instead, you can spend some of your cookies to hire grandmas to bake cookies for you. Over time, you gain access to cookie farms, factories, and a seemingly endless series of increasingly absurd and productive cookie generation devices.

A game of Cookie Clicker in progress. Don’t look too closely; you may not be able to look away.

A lot of ink was spilled at the time on cookie clicker’s addictive nature. Despite its apparent simplicity, there was something incredibly fun and moreish about its core gameplay loop. While it was not the first game in this genre, it is perhaps the most influential. Within weeks, the Internet was awash with hundreds of ‘clicker’ games ranging from simple re-skins of Cookie Clicker to inventive new interpretations of the genre that pushed its boundaries and articulated the core elements that made it fun. The genre eventually came to be known as ‘incremental’ or ‘idle’ games, centred around an active Reddit community, and even inspiring dozens of research papers.

Down the rabbit hole

One of the peculiar and interesting aspects of this game is that it rewards you for not playing it. Often, the next cookie purchase requires several orders of magnitude more cookies than you are currently producing, so you have no option but to leave the game alone, sometimes for days, before your cookie factories churn out enough cookies for you to unlock the next level of progress. This is why it is called an ‘idle’ game.

One of the interesting aspects of this game is that it rewards you for not playing it.

However, the game is clever; it provides multiple potential pathways towards your next goal, and small decisions you make can compound, making the difference between an overnight wait or a wait of several days. Calculating the best way to allocate your limited resources before leaving the game to run overnight becomes a complex mathematical puzzle with many moving parts. Catnip for nerds.

I fell pretty hard for Cookie Clicker. I had it running on my computer 24×7 for months, tending to it every few hours. I visited the Cookie Clicker subreddit everyday, where people would discuss strategy, post tips, and results of experiments. A cottage industry of browser add-ons emerged that allowed you to automate certain aspects of gameplay as well as track cookie production statistics and compute in-game decision formulas. The game was in written in the web programming language JavaScript, and so were these add-ons.

By playing with and tweaking these add-ons I inadvertently learned a lot about how programming for the browser works. I was even able to track down the root cause of an issue where I wasn’t generating cookies as quickly as my calculations anticipated. It turns out that Google Chrome throttles JavaScript timers in background tabs, meaning that in order for my add-on to auto-click as fast as it was supposed to, the tab needed to be visible in the foreground. I plotted a graph of the phenomenon and presented my research to the reddit community, adding to the growing body of knowledge about this game (and my own growing body of knowledge about programming for the browser).

By playing, I inadvertently learned a lot about how programming for the browser works.

My failed attempt at making a game

At some point, several months into my Cookie Clicker journey (and having also become a connoisseur of several other incremental games along the way), I finally started to feel the creative itch. Why couldn’t I make my own incremental game?

Being steeped in academia, and unburdened by the cynicism that comes with experience, I imagined an incremental game where you start off as a lowly graduate student, trying to gain research points by writing papers, and slowly go up the ranks of professorship until you become the head of a department churning out hundreds of high-quality research papers and attracting billions of investment dollars. I started writing a very basic game. I had a rough vision in mind, but I didn’t have all the web programming skills I needed yet. So I learned one step at a time, one concept as a time, until I knew enough to put together a rough prototype.

My game, descriptively titled ‘Research Simulator’. Aren’t you just itching to know what lies beyond the venerable Senior Lecturer?

It looked awful, but it had a few minutes of gameplay and was a good proof-of-concept. I continued working on the game for a few days but, my itch having been scratched, I ran out of steam. I also became aware that some clever scientists at CERN had developed a much better version of the game that I was planning to build. I abandoned my game and didn’t give the episode another thought.

A skill I didn’t know I had

A few months later I found myself in a discussion with my Ph.D. advisor about uncertainty in charts and graphs. We had come up with the idea of using error bars as a control mechanism — if you want to compute a data point with more certainty, could you drag the error bars around to indicate the level of uncertainty you were comfortable with? We designed an experiment to test this idea, now all we needed was an interactive prototype for people to use.

It’s worth mentioning that until this point my primary programming experience was in Java, Python, and R. I used Python and R for data manipulation and analysis, and Java for prototyping user interfaces (UIs). The problem was that I wasn’t very good at it. Although my own skills were partly to blame, from my perspective it was also the case that Java’s antiquated UI programming libraries were terrible for rapid UI prototyping. As a consequence, in the entire first year of my Ph.D. I produced only one, rather simple prototype.

I was dreading having to build this error bars prototype in Java when it suddenly hit me: why not build this as a web app? It seems completely obvious in hindsight, but it was a revelation to me at the time. For various reasons, the ecosystem of web development is much better geared towards UI prototyping. The months spent futzing about in Cookie Clicker add-ons, reading JavaScript documentation, and building my own terrible game had, unbeknownst to me, laid the foundations for all the skills I needed to build, tweak, debug, and most importantly, learn what I needed to learn to complete this prototype.

Previously, this might have taken me weeks. With JavaScript, I built the prototype in hours. Using web technology had another advantage: it was easy to deploy the study as a website and therefore get many more participants than I would have normally gotten in a lab-based experiment. The study was completed within a month and was published at a good conference.

My success with this study encouraged and empowered me. It became painfully clear how much of a creativity and productivity bottleneck my UI programming skills had been. The transition to web development resulted in a 10x increase in my prototyping speed. Over the next two years of my PhD I built several prototypes of varying complexity. And while other challenges inevitably arose, the building of UI prototypes remained a smooth and enjoyable experience throughout. I cannot conceive of how I would have produced nearly as much if I had stuck with Java.

Cookie Clicker saved my Ph.D.

The transition to web development resulted in a 10x increase in my prototyping speed. Cookie Clicker saved my Ph.D.

The moral of the story

There are several ways we could moralise this story.

We could view my investigations of Cookie Clicker’s mechanisms as ‘basic research’, i.e., research without an immediate application in mind. There is evidence that basic research in the sciences leads to productivity increases in manufacturing. Eloquent arguments have been made that forgoing basic research is costly, that “when academic research starts demonstrating industry relevance is when funding should be cut off, not augmented.

Or, taking it down a peg, we could simply label what I was doing as gathering ‘useless knowledge’, which like basic research does not have an immediate application in mind, and additionally does not even seek to answer questions considered useful in some way. Here again, mine would not be the first story lending credence to the idea that useless knowledge is, ultimately, useful!

But perhaps the simplest way to view this episode was as a playful, recreational activity which through sheer dumb luck gave me the skills needed to solve an important problem in my work life. My colleague Titus Barik analysed how programmers talk about programming as play, involving ‘spontaneous and creative expression’, ‘experimentation’, and ‘purposeless, ludic activity’. He found that many programmers reflect on episodes of playful programming as joyful experiences that catalysed learning.

I’m extremely grateful to Cookie Clicker for the journey it put me on, but even if I hadn’t ended up learning JavaScript because of it, I still wouldn’t regret, and would cherish, the many many hours I spent tinkering and clicking away in my college dorm bedroom.


Mansfield, Edwin. “Basic research and productivity increase in manufacturing.” The American Economic Review 70, no. 5 (1980): 863-873.

Flexner, Abraham. “The Usefulness of Useless Knowledge”. Harpers, issue 179, June/November 1939. Available at

Barik, Titus. “Expressions on the nature and significance of programming and play.” In 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 145-153. IEEE, 2017.

Sarkar, Advait. Interactive analytical modelling. No. UCAM-CL-TR-920. University of Cambridge, Computer Laboratory, 2018.

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Interaction with Uncertainty in Visualisations.” In EuroVis (Short Papers), pp. 133-137. 2015.

Sarkar, Advait, Alan F. Blackwell, Mateia Jamnik, and Martin Spott. “Teach and try: A simple interaction technique for exploratory data modelling by end users.” In 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 53-56. IEEE, 2014.

Sarkar, Advait, Martin Spott, Alan F. Blackwell, and Mateja Jamnik. “Visual discovery and model-driven explanation of time series patterns.” In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 78-86. IEEE, 2016.

Sarkar, Advait, Mateja Jamnik, Alan F. Blackwell, and Martin Spott. “Interactive visual machine learning in spreadsheets.” In 2015 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 159-163. IEEE, 2015.

Sarkar, Advait, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff et al. “Setwise comparison: Consistent, scalable, continuum labels for computer vision.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 261-271. 2016.

Revealing the hidden guesswork of spreadsheet comprehension

This article is based on the following publication [PDF]:

Sruti Srinivasa Ragavan, Advait Sarkar, and Andrew D Gordon. 2021. Spreadsheet Comprehension: Guesswork, Giving Up and Going Back to the Author. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Article 181, 1–21. DOI:

What’s the problem?

Not a year goes by without a spreadsheet horror story making headlines – British readers might most recently recall the error in Public Health England’s test-and-trace pipeline that led to the delayed contact tracing of 16,000 positive COVID-19 cases. Others might recall the infamous Reinhart-Rogoff paper “Growth in a time of Debt” that led several governments to favour severe austerity measures in an attempt to reduce their debt-to-GDP, but which was discovered later to be based upon a flawed analysis owing in part to a spreadsheet error. While these sensational stories make for great headlines, it is important to remember that for every high-profile spreadsheet error, there are millions of people who are quietly empowered by spreadsheets every day to understand their data, make better decisions, and improve their lives and businesses.

Nonetheless, helping to prevent, detect, and fix errors in spreadsheets is a central and enduring research problem for spreadsheet researchers.

To date, almost all design research towards mitigating spreadsheet errors has been aimed at the author – the person who writes the spreadsheet, and in particular the person who writes spreadsheet formulas. Spreadsheet formulas are, of course, what make spreadsheets programs and since many researchers in this area have a computer science background, viewing a spreadsheet as a bunch of code can be reassuringly familiar.

However, this neglects two important facts. First – spreadsheets are not just code! In fact, the majority of spreadsheets contain few or no formulae at all! Spreadsheets also contain data, labels, colouring, data validations, charts, notes and comments. Second: most of the time people spend with spreadsheets is as a reader, not as an author. Most spreadsheets are used and read by multiple people, including the author, and many spreadsheets are collaboratively authored. Spreadsheet authors may themselves spend more time reading and reviewing their work than writing.

When we came across a study which found that errors in comprehension are among the top 5 most common sources of spreadsheet errors, it became clear that we needed to study the process of spreadsheet comprehension in much greater detail, to identify the moments where errors could occur, and, potentially, design to prevent them.

Spreadsheets are not just code. In fact, the majority of spreadsheets contain few or no formulae at all.

How did we study it?

To understand spreadsheet comprehension, we designed a study in which we observed people who had been sent an unfamiliar spreadsheet by their colleagues, while they tried to understand it. These were real spreadsheets that participants actually needed to understand as part of their day-to-day work. We asked participants to think aloud as they read the spreadsheet: what were they looking at, and why? When were they confused? What strategies were they trying to apply?

We made detailed screen and audio recordings as they read their spreadsheets, speaking out aloud. Then, we chopped each recording into 20-second segments. For each segment, my colleague and I noted down the following. In that moment,

  1. What types of information was the participant seeking?
  2. What strategies was the participant employing to get that information?
  3. What barriers and pain points were they encountering?

This, as you might imagine, was not an easy task. We didn’t have predefined lists of information types, strategies, or barriers, so this had to be developed iteratively. Moreover, it was not always clear how to classify a certain segment. Often we would find ourselves re-watching and discussing a 20-second segment at half speed for several minutes before we reached a consensus regarding what the participant was doing!

What did we find?

The analysis took weeks, but it was well worth it. The fine-grained classification of user activities allowed us to paint the most detailed picture of spreadsheet comprehension to date. Here is what we found.

Our first finding was that spreadsheet comprehension is a lot more than just formula comprehension. Prior work had largely taken the view of ‘spreadsheets as code’ and therefore focused on difficulties in understanding formulas. While our participants also faced difficulties understanding formulas, we also saw spreadsheets replete with comprehension challenges that contained no formulas at all! People have difficulties understanding data, interpreting and comparing charts, data validation rules, conditional formatting rules, and even just the formatting applied to a cell (e.g., why is the text in this cell red?). So when we think about spreadsheet comprehension, we need to think about all these activities and not just about formulas.

Our second finding was that participants spent a whopping 40% of their time on what we call ‘information seeking detours’. Often, participants needed to navigate away from the specific part of the spreadsheet they were trying to understand to a different part, or even away the spreadsheet application entirely to the web, documents, or emails, where they could gather the information they needed to continue. For example, it was very common while trying to understand a formula for the user to visit each of the cell references mentioned in the formula and scan the area of the spreadsheet for labels and other documentation. Such context switches are not just productivity loss, but we also found that these switches could themselves introduce errors, such as when one participant went to another sheet to look up a cell reference, but subsequently misremembered it when typing it in.

Participants spent a whopping 40% of their time on ‘information seeking detours’, navigating away from the part of the spreadsheet they were trying to understand.

Our third finding was an astonishing reliance on guesswork. Often, information seeking detours led to a dead end, and the participant was unable to find out what they needed. So, they guessed. Sometimes they were able to frame their guesses explicitly and test them in some form. For example, they might guess that a number might be the sum of certain other numbers – they could verify this guess by adding up the numbers themselves to see if that was the case. But often these guess couldn’t be easily tested and so participants just continued with unverified assumptions. Such guesses are clearly a problem, as an incorrect assumption could cascade into a host of incorrect understandings all over the spreadsheet. In some cases it was impossible even to make a plausible guess, and in that situation the participants needed to consult with the spreadsheet’s author. The author was not always readily available and so this shows another source of productivity loss. So we need to think about the problem of missing information to help users with spreadsheet comprehension.

Finally, while it’s tempting to say that these comprehension difficulties might be because the spreadsheet was written poorly, with lack of explanations, this was not the case at all. All the spreadsheets we saw in our study contained some form of good layout, colours, documentation – all the best practices that are recommended to build good spreadsheets. So the comprehension difficulties were not because of the lack of authorial effort, but despite all the authors did to make the spreadsheets more comprehensible. In the paper, we make some arguments for why this might be the case, and what we could do about it.

What can we do about it?

There are two sides of the comprehension equation: the reader, and the author. From the reader’s perspective, we can improve comprehension in spreadsheets by making hidden information more visible (such as conditional formatting, number formatting, and data validation). We can also reduce the number and the cost of information seeking detours, for example, by allowing users to view other portions of the spreadsheet without navigating away from the portion they were interested in understanding.

From the author’s perspective, we need to design for more fluid and relevant annotation, enabling authors to write explanations and documentation inline with the data, and detecting areas of the spreadsheet, potentially using machine learning, that are likely to benefit from additional explanation.


Our study is the first fine-grained analysis of spreadsheet comprehension. It revealed a number of previously unknown issues and new opportunities for designing tools and interfaces that will make visible what is hidden in spreadsheets, in ways that benefit the millions of users who depend on spreadsheets every day.

What to learn more? Read our paper here [PDF], and see the publication details below:

Sruti Srinivasa Ragavan, Advait Sarkar, and Andrew D Gordon. 2021. Spreadsheet Comprehension: Guesswork, Giving Up and Going Back to the Author. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Article 181, 1–21. DOI:


This article reports joint work by Sruti Srinivasa Ragavan, Advait Sarkar, and Andy Gordon.

The rise of parallel chat in online meetings: how can we make the most of it?

By Advait Sarkar and Sean Rintel. Originally published on the Microsoft Research Blog.

“I’ll put a link to that doc in the chat”

“Sorry, my internet is terrible, I’ll put my question in the chat”

“That GIF from Amy in the chat is hilarious”

If these phrases sound familiar to you, you’re not alone – people the world over use the chat function in Microsoft Teams and other video-calling services every day. But we could ask: why post messages in parallel to the main conversation? Does it improve or worsen the meeting experience? And crucially, what can we do to make it more effective?

Why do people chat – and is it a good thing for meetings?

To answer these questions, we conducted a study drawing on two sources of data: (1) the diaries of 849 Microsoft employees who journaled their experiences during the summer of 2020, when COVID-19 triggered widespread remote work and online meetings, and (2) a survey of 149 Microsoft employees that specifically asked how they used chat. The paper was accepted at the 2021 ACM CHI Virtual Conference on Human Factors in Computing Systems.

In the survey, the overwhelming majority (85.7%) of participants agreed that parallel chat is a net positive. Only 4.5% responded negatively, and 9.8% were neutral. 

Chat messages themselves can contain a lot of different things: questions; links and documents; agreement and praise that add to what is being conveyed during the call; discussion of related and unrelated topics; and humour and casual conversation. 

Many people did report being distracted by chat – it is difficult to focus on the audio/video (AV) of a meeting while also participating in the chat. People clash over different expectations around how to chat and how formal chat needs to be. For participants that fail to notice important chat entries, confusion can ensue. Moreover, chat poses challenges for people with reading difficulties or people who find it hard to understand sentiment in text. Images posted in chat may not come with alt text to help blind and low vision people understand them. 

As one participant says, “[…] sometimes it’s very distracting as multiple threads are happening that get tangential from the main presenter/speaker. […] it’s really hard to keep track of multiple conversations AND pay attention to the speaker.

And yet our research showed compelling benefits of chat within meetings, which argues for encouraging it, even as we continue to look for ways to minimize the negative consequences.

The many benefits of chat

Chat has become essential in virtual meetings – many online meetings would be much less efficient, and some would be impossible, without chat. Chat enables people to organize their collaboration and action around documents and follow-up meetings. It enables people to work around problems such as poor connectivity and technical issues, language barriers, and inscrutable jargon. It helps manage turn taking and questions/answers, especially in large meetings.

A survey participant explains, “there have been meetings where important links were able to be provided in the text chat, important and relevant topics were brought up and then incorporated into the meeting, etc – these are times when I feel like I really could not live without [it]”

Beyond these functional roles, chat also enables humor and casual conversation, which give meetings a much-needed sense of social support and connection.

As one participant puts it, “we use text chat to send ‘cheers’ and fun gifs to celebrate moments […] this tends to generate a lot of enthusiasm and makes these types of meetings more fun. like people’s personalities coming out.

Perhaps most importantly, chat can be a means of inclusion. Chat enables people to participate without interrupting the speaker, preserving the flow of the meeting. It enables contributions from those who are shy or unable to speak. And by keeping a record of reactions to posts, it can help participants support good ideas that arise from the sidelines. One participant observed “people contributing through chat that might not have a voice otherwise – either limited by technology (no microphone), environment (loud, distracting) or personal preference (shy, new, still finding the way in the team’s culture.)”

Our poll data provided a specific example of chat’s power for inclusivity: women aged 25-34 were much more likely than any other group to report an increase in chat use after shifting to remote work. And women of every age reported more chat usage than men.

A picture of a table showing respondents who reported increased chat usage, sorted by age and gender. Age 25 to 34 shows 20 percent for men and 59 percent for women. Age 35-44 shows 8 percent for men and 29 percent for women. Age 45-64 shows 24 percent for men and 26 percent for women.
Figure 1: Increased chat use was most reported by women aged 25-34. The figure shows the proportion responding ‘strongly agree’ that their own chat use has increased.

This is an important finding, because gender is associated with different meeting experiences and participation rates. Young women sometimes find it difficult to be heard during meetings, and our study suggests that chat gives them another way to participate. However, we should not equate participation in chat with speaking in the ‘main’ audio/video of the call, which is generally more significant.

More broadly, members of minority groups or people with disabilities may have positive or negative experiences with chat. This could include neurodivergent professionals and those who are blind or low vision. On the other hand, chat may enable greater participation for people who lack the ability to speak. Further study is needed to understand whether chat improves the inclusion of those who suffer from systemic disadvantage, or whether it is entrenching that disadvantage, perhaps even exacerbating it, by relegating their participation to a side channel.

How to make the most of chat in online meetings

Chat has both benefits and disadvantages. Used well, it can be a powerful and effective tool in online meetings. Used poorly, it can cause distraction and confusion. Based on our study, here are some guidelines for using chat effectively:

  1. Establish expectations about chat usage before the meeting starts. Clear guidelines will support chat that is inclusive, as well as productive.
  2. Consider accessibility challenges (e.g., some attendees may have difficulty reading text or understanding sentiment in text).
  3. Encourage chat that engages with the meeting topic or makes the meeting more inclusive. Discourage overly off-topic, exclusionary, inaccessible chat.
  4. Monitor chat for questions and comments and address them in the main conversation.
  5. Include a chat summary in meeting archives to preserve and share ideas and feedback.

We’ve also written a meeting chat guide with more explanation of these guidelines.

Chat to the future

Improving chat can lead to more effective conversations as well as enhanced meeting tools. Based on our study, we identified several opportunities for enhancing chat through design. For example, we could use machine learning, or a tagging feature, to identify and differentiate between types of chat messages, so participants could visually recognize questions, clarifications, comments, kudos, on and off-topic talk. We could better integrate chat with the main audio-video conversation by showing indicators of whether chat is quiet or busy, highlighting messages containing terms that match what is being discussed, and integrating images and websites in the chat into the main video stream. For a longer list of opportunities, see section 4.2 of our paper.

The age of online meetings has just begun. We’re still learning how to build tools for effective online collaboration. Even in things as unassuming and prosaic as chat, there are challenges but also tremendous opportunities.

In the spirit of using poetry to help communicate science, we leave readers with a poem written by the first author (Advait Sarkar) that takes inspiration from our study, which was conducted during a pandemic that kept us all apart far longer than expected:

Summer came, but winter’s game
Would see no end
The micro-reign, devil stain
Cleft foe from foe, and friend from friend

Learnt we swift, our this day’s gift
To meet apart
Let sound, sight, and what we write
Join head to head, and heart to heart

Want to learn more? Read our paper and see the publication details below:

Advait Sarkar, Sean Rintel, Damian Borowiec, Rachel Bergmann, Sharon Gillett, Danielle Bragg, Nancy Baym, and Abigail Sellen. 2021. The promise and peril of parallel chat in video meetings for work. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI ’21 Extended Abstracts), May 8–13, 2021, Yokohama, Japan. ACM, New York, NY, USA 8 Pages.


This research was authored by Advait Sarkar, Sean Rintel, Damian Borowiec, Rachel Bergmann, Sharon Gillett, Danielle Bragg, Nancy Baym, and Abigail Sellen and part of a larger project about meetings during the COVID-19 pandemic.

Tell, don’t show: how to teach AI

Should we teach good behaviour to Artificial Intelligence (AI) through our feedback, or should we try and tell it a set of rules explaining what good behaviour is? Both approaches have advantages and limitations, but when we tested them in a complex scenario, one of them emerged the winner.

If AI is the future, how will we tell it what we want it to do?

Artificial intelligence is capable of crunching through enormous datasets and providing us assistance in many facets of our lives. Indeed, it seems this is our future. An AI assistant may help you decide what gifts to buy for a friend, or what books to read, or who to meet, or what to do on the weekend. In the worst case, of course, this could be dystopian – AI controls us, and not the other way around, we’ve all heard that story – but in the best case, it could be incredibly stimulating, deeply satisfying, and profoundly liberating.

But an important and unsolved problem is that of specifying our intent, our goals, and our desires, for the AI system. Assuming we know what we want from the AI system (this is not always the case, as we’ll see later), how do we teach the system? How do we help the system learn what gifts might be good for a friend, what books we might like to read, the people we might like to meet, and the weekend activities we care about?

There are many parts to this problem, and many solutions. The solution ultimately depends on the context in which we’re teaching the AI, and the task we’re recruiting it to do for us. So in order to study this, we need a concrete problem. Luckily for me, Ruixue Liu decided to join us at Microsoft for an internship in which she explored a unique and interesting problem indeed. The problem we studied was how to teach an AI system to give us information about a meeting, where for some reason, we can’t see the meeting room.

Our problem: eyes-free meeting participation

When people enter a meeting room, they can typically pick up several cues: who is in the meeting? Where in the room are they? Are they seated or standing? Who is speaking? What are they doing? Research shows that not having this information can be very detrimental to meeting participation.

Unfortunately, in many modern meeting scenarios, this is exactly the situation we find ourselves in. People often join online meetings remotely without access to video, due to device limitations, poor Internet connections, or because they are engaged in parallel “eyes-busy” tasks such as driving, cooking, or going to the gym. People who are blind or low vision also describe this lack of information as a major hurdle in meetings, whether in-person or online.

We think an AI system could use cameras in meeting rooms to present this information to people who, for whatever reason, cannot see the meeting room. This information could be relayed via computer-generated speech, or special sound signals, or even through haptics. Given that the participant only has a few moments to understand this information as they join a meeting, it’s important that only the most useful information is given to the user. Does the user want to know about people’s locations? Their pose? Their clothes? What information would be useful and helpful for meeting participation?

However, what counts as ‘most useful’ varies from user to user, and context to context. One goal of the AI system is to learn this, but it can’t do so without help from the user. Here is the problem: should the user tell the system what information is most useful, by specifying a set of rules about what information they want in each scenario, or should the user give feedback to the system, saying whether or not it did a good job over the course of many meetings, with the aim of teaching it correct behaviour in the long term?

Our study, in which we made people attend over 100 meetings

Don’t worry – luckily for the sanity of our participants, these weren’t real meetings. We created a meeting simulator which could randomly generate meeting scenarios. Each simulated meeting had a set of people – we generated names, locations (within the room), poses, whether they were speaking or not, and several other pieces of information. Because we were testing eyes-free meeting participation, we didn’t visualise this information – the objective was for the user to train the system to present a useful summary of this information in audio form.

We conducted a study in which 15 participants used two approaches to ‘train’ the system to relay the information they wanted. One approach was a rule-based programming system, where the participant could specify “if this, then that”-style rules. For example, “if the number of people in the meeting is less than 5, then tell me the names of the people in the meeting”.

The other approach was a feedback-based training system (our technical approach was to use a kind of machine learning called deep reinforcement learning). In the feedback-based training system, the user couldn’t say what they wanted directly, but instead, as they went to various (simulated) meetings, the system would do its best to summarise the information. After each summary, the user provided simple positive/negative feedback, answering “yes” or “no” to the question of whether they were satisfied with the summary.

Each participant tried both systems, one after the other in randomised order. We let participants play around, test and tweak and teach the AI as much as they liked, and try out the system’s behaviour on as many simulated meetings as they liked. Many participants “attended” well over 100 meetings, with two participants choosing to attend nearly 160 meetings over the course of the experiment! Who knew meetings could be such fun!

We asked participants to fill out a few questionnaires about their experience of interacting with both systems, and we conducted follow-up interviews to talk about their experience, too.


Participants reported a significantly lower cognitive load and higher satisfaction when giving the system rules, than giving feedback. Thus, it was easier and more satisfactory to tell the AI how to behave, than to show it how to behave through feedback.

Rule-based programming gave participants a greater feeling of control and flexibility, but some participants found it hard at the beginning of the experiment to formulate rules from scratch. Participants also found it hard to understand how different rules worked together, and whether conflicting rules had an order of precedence (they did not).

Feedback-based teaching was seen by participants as easier, but much more imprecise. There were instances where the system did something almost correct, but because the user could only say whether the behaviour was good or bad, they did not have the tools to give more nuanced feedback to the system. Moreover, people don’t just know their preferences, they figure them out over time. With feedback-based teaching, participants worried that they were ‘misleading’ the system with poor feedback at the early stages of training, while they were still figuring out what their preferences were.


Based on our results, we would recommend a rule-based programming interface. But as explained, we found several advantages and limitations to both approaches. In both cases, we found that the first step was for the human to figure out what they wanted from the system! This is hard if the user doesn’t have a clear idea of what the system can and can’t do; our first recommendation is for system designers to make this clear.

Our participants also had a hard time in both cases expressing their preferences exactly: with rules, it was because the rule-based programming language was complex, and with feedback-based teaching, it was because yes/no feedback isn’t precise enough. Our second recommendation is to make clear to users what actions they need to take to specify certain preferences.

Finally, it was difficult for participants to understand the system they finally trained; it was difficult to know what rules would apply in certain scenarios, and they also found the feedback-trained system to be unpredictable. Our third recommendation is to provide more information as to why the system does what it does in certain scenarios.

In the future, we should consider blending the two approaches, to get the best of both worlds. For example, the feedback-based system could be used to generate candidate rules, to help users form a better idea of their preferences, or detect hard-to-specify contexts. Rule-based systems could help define context, explain behaviour learnt by the system, and provide a way for specifying and editing information not captured by the feedback-trained system. We aren’t sure what this might look like, but we’re working on it. Until then, let’s aim to tell, and not show, what we want our AI to do.

Here’s a summary poem:

Yes, no, a little more
What do you want?
I can do this, this, and this
But that I can’t

Tell me and I’ll show you
What you can’t see
I’ll do my best to learn from
What you tell me

Want to learn more? Read our study here (click to download PDF), and see the publication details below:

Liu, Ruixue, Advait Sarkar, Erin Solovey, and Sebastian Tschiatschek. “Evaluating Rule-based Programming and Reinforcement Learning for Personalising an Intelligent System.” In IUI Workshops. 2019.

People reluctant to use self-driving cars, survey shows

Autonomous vehicles are going to save us from traffic, emissions, and inefficient models of car ownership. But while songs of praise for self-driving cars are regularly sung in Silicon Valley, does the public really want them?

That’s what my student Charlie Hewitt, and collaborators Ioannis Politis and Theocharis Amanatidis set out to study. We decided to conduct a public opinion survey to find out.

However, we first had to solve two problems.

  1. When Charlie started his work, there were no existing surveys designed specifically around autonomous vehicles. We had some surveys for technology acceptance in general, and some for cars, which are a good start. So we combined those and introduced some additional information. This resulted in the creation of a new survey designed specifically for autonomous vehicles. We called it the Autonomous Vehicle Acceptance Model, or AVAM for short.
  2. When people think of self-driving cars, they generally picture a futuristic pod with no steering wheel or controls, that they just step into and get magically transported to their destination. However, the auto industry differentiates between six levels of autonomy. Previous studies had attempted to get people’s attitudes to each of these levels, but it turns out people can’t picture these different levels of autonomy very well, and don’t understand how they differ. So, Charlie created short descriptions to explain the differences between them. These vignettes are a key part of the AVAM, because they help the general public understand the implications of different levels of autonomy.

Here are the six levels of autonomous vehicles as described in our survey:

  • Level 0: No Driving Automation. Your car requires you to fully control steering, acceleration/deceleration and gear changes at all times while driving. No autonomous functionality is present.
  • Level 1: Driver Assistance. Your car requires you to control steering and acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. You are required to maintain control of the steering at all times.
  • Level 2: Partial Driving Automation. Your car requires you to control steering and  acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. The car can also follow the highway’s lane markings and change between lanes autonomously, but may require you to retake control with little or no warning in emergency situations.
  • Level 3: Conditional Driving Automation. Your car can drive partially autonomously on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take control and steer, accelerate/decelerate and switch lanes as appropriate. The car is aware of potential emergency situations, but if it encounters a confusing situation which it cannot handle autonomously then you will be alerted and must retake control within a few seconds. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.
  • Level 4: High Driving Automation. Your car can drive fully autonomously only on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take full control and can steer, accelerate/decelerate and switch lanes as appropriate. The car does not rely on your input at all while on the highway. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.
  • Level 5: Full Driving Automation. Your car is fully autonomous. You are able to get into the car and instruct it where you would like to travel to, the car then carries out your desired route with no further interaction required from you. There are no steering or speed controls as driving occurs without any interaction from you.

Before you read on, think about each of those levels. What do you think are the advantages and disadvantages of each? Which would you be comfortable with and why?

We sent our survey to 187 drivers recruited from across the USA, and here’s what we found:

Result 1: our respondents were not ready to accept autonomous vehicles.

We found that on many measures, people report a lower acceptance of higher automation levels. People perceive higher autonomy levels as being less safe, they report lower intent to use them, and higher anxiety with higher autonomy levels.

We compared some of the results with those from an earlier study, conducted in 2014. We had to make some simplifying assumptions, as the 2014 study wasn’t conducted with the AVAM. However, we still found that our results were mostly similar: both studies found that people (unsurprisingly) expected to have to do less as the level of autonomy increased. Both studies also found that people showed lower intent to use higher autonomy vehicles, and poorer general attitude towards higher autonomy. Self-driving cars seem to be suffering in public opinion!

Result 2: the biggest leap in user perception comes with full autonomy.

We asked people how much they would expect to have to use their hands, feet and eyes while using a vehicle at each level of autonomy. Even though vehicles at the intermediate levels of autonomy (3 and 4) can do significantly more than levels 1 and 2, people did not perceive the higher levels as requiring significantly less engagement. However, at level 5 (full autonomy), there was a dramatic drop in expected engagement. This was an interesting and new finding (albeit not entirely surprising). One explanation for this is that people only really perceive two levels of autonomy: partial and full, and don’t really care about the minor differences in experience with different levels of partial autonomy.

All in all, we were fascinated to learn about people’s attitudes to self-driving cars. Despite the enthusiasm displayed by the tech media, there seems to be a consistent concern around their safety and reluctance to adopt amongst the general public. Even if self-driving cars really do end up being safer and better in many other ways than regular cars, automakers will still face this challenge of public perception.

And now, a summary poem:

The iron beast has come alive,
We do not want it, do not want it
Its promises we do not prize
It does not do as we see fit

Only when we can rely
On iron beast with its own eye
Only then will we concede
And disaffection yield to need

If you’re interested in using our questionnaire or our data, please reach out! I’d love to help you build on our research.

Want to learn more about our study? Read it here (click to download PDF) or see the publication details below:

Charlie Hewitt, Ioannis Politis, Theocharis Amanatidis, and Advait Sarkar. 2019. Assessing public perception of self-driving cars: the autonomous vehicle acceptance model. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI ’19). ACM, New York, NY, USA, 518-527. DOI:

Ask people to order things, not score them

Ever graded an essay? Given scores to interview candidates? Given a rating to an item on Amazon? Liked a video on YouTube?

We’re constantly asked to rate or score things on absolute scales. It’s convenient: you only have to look at each thing once to give it a score, and once you’ve got a set of things all reduced to a single number, you can compare them, group them into categories, and find the best one (and the worst).

However, a growing body of evidence points to the fact that humans are simply not very good at giving absolute scores to things. By not very good, we mean there are two problems:

  • Different people give different scores to the same thing (low inter-rater reliability)
  • The same person can give different scores to the same thing, when asked to score it repeatedly (low intra-rater reliability)

But don’t worry! There’s a better way: ordering things, not scoring them. Let me illustrate with two case studies.

Making complex text easier to read

A cool modern application of artificial intelligence / machine learning is “lexical simplification”, which is an ironically fancy way of saying “making complex text easier to read by substituting complex words with simpler synonyms”. This is a great way to make text accessible to young readers and those not fluent in the language. Finding synonyms for words is easy, but detecting which words in a sentence are “complex” is hard.

To teach the AI system what counts as a complex word and what doesn’t, we need to give it a bunch of labelled training examples. That is, a list of words that have already been labelled by humans as being complex or not. Now traditionally, this dataset was generated by giving human labellers some text, and asking them to select the complex words in that text. This is a simple scoring system: every word is scored either 1 or 0, depending on whether the word is complex or not.

However, we knew from previous research that people are inconsistent in giving these absolute scores. So, my student Sian Gooding set out to see if we could do better. She conducted an experiment where half the participants used the old labelling system, and the other half used a sorting system. In the sorting system, participants were given some text, and asked to order the words in that text from least to most complex.

We found that with the sorting system, participants were far more consistent and created a far better labelled training set!

Helping clinicians assess multiple sclerosis

The Microsoft ASSESS-MS project aimed to use the Kinect camera (which captures depth information as well as regular video) to assess the progression of multiple sclerosis. The idea is that because ­­­MS causes degeneration of motor function that manifests in movements such as tremor, it should be possible to use computer vision to track and understands a patient’s movements with the Kinect camera, and assign them a score corresponding to the severity of their illness.

To train the system, we first needed a set of labelled training videos. That is, videos of patients for which neurologists had already provided the severity of illness scores. The problem was that the clinicians were giving scores on a standardised medical scale of 0 to 4, but their scores were suffering from poor consistency! With inconsistent scores, there was little hope that the computer vision system would learn anything.

The video illustrates our deck sorting interface for clinicians

Our solution was to ask clinicians to sort sets of patient videos. We found that giving clinicians “decks” of about 8 videos to sort in order of illness severity worked well – any more than that and the task became too challenging. But we wanted them to rate nearly 400 videos. To go from orderings of 8 videos at a time, to a full set of orderings for the entire dataset, we needed an additional step. For this, we used the TrueSkill algorithm, which is able to merge the results from many orderings (how exactly we did this is detailed in our paper, which you can read here (PDF)).

To our amazement, we found that the resulting scores were significantly more consistent than anything we had previously measured, and handily exceeded clinical gold standards for consistency.

But why does it work?

It’s not yet clear why people are so much better at ordering than scoring. One hypothesis is that it requires people to provide less information. When you score something on a scale of 1-10, you have 10 choices for your answer. But when you compare two items A and B, you only have 3 choices: is A less than B, or is B less than A, or are they equal? However, this hypothesis doesn’t explain what Sian and I saw in the word complexity experiment, since in the scoring condition, users were only assigning scores of 0 or 1. Another hypothesis is that considering how multiple items relate to each other gives people multiple reference points, leading to better decisions. More research is required to test these hypotheses.

In conclusion

People are asked to score things on absolute scales all the time, but they’re not very good at it. We’ve shown that people are significantly better at ordering things in a variety of domains, including identifying complex words, and assessing multiple sclerosis, although we’re not quite sure why.

The next time you find yourself assigning absolute scores to things – try ordering them instead. You might be surprised at the clarity and consistency it brings!

And now, a summary poem:

I wished to know the truth about this choice
And with no guide I found myself adrift
No measure, no register, no voice
But when juxtaposed with others,
brought resolution swift.

Black and white, true and false, desire:
Nature makes a myriad form of each.
Context drives our understanding higher,
To compare things brings them well within our reach.

Want to learn more about our studies? See the publication details below:

Sarkar, Advait, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff et al. “Setwise comparison: Consistent, scalable, continuum labels for computer vision.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 261-271. ACM, 2016. Download PDF

Gooding, Sian, Ekaterina Kochmar, Alan Blackwell, and Advait Sarkar. “Comparative judgments are more consistent than binary classification for labelling word complexity.” In Proceedings of the 13th Linguistic Annotation Workshop, pp. 208-214. 2019. Download PDF

Steinheimer, Saskia, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi et al. “Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support–a proof of concept study.” Disability and rehabilitation (2019): 1-7.