That’s what my student Charlie Hewitt, and collaborators Ioannis Politis and Theocharis Amanatidis set out to study. We decided to conduct a public opinion survey to find out.

However, we first had to solve two problems.

- When Charlie started his work, there were no existing surveys designed specifically around autonomous vehicles. We had some surveys for technology acceptance in general, and some for cars, which are a good start. So we combined those and added some additional information. This resulted in the creation of a new survey designed specifically for autonomous vehicles. We called it the Autonomous Vehicle Acceptance Model, or AVAM for short.
- When people think of self-driving cars, they generally picture a futuristic pod with no steering wheel or controls that they just step into and get magically transported to their destination. However, the auto industry differentiates between six levels of autonomy. Previous studies had attempted to get people’s attitudes to each of these levels, but it turns out people can’t picture these different levels of autonomy very well, and don’t understand how they differ. So, Charlie created short descriptions to explain the differences between them. These vignettes are a key part of the AVAM, because they help the general public understand the implications of different levels of autonomy.

Here are the six levels of autonomous vehicles as described in our survey:

**Level 0**: No Driving Automation. Your car requires you to fully control steering, acceleration/deceleration and gear changes at all times while driving. No autonomous functionality is present.**Level 1**: Driver Assistance. Your car requires you to control steering and acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. You are required to maintain control of the steering at all times.**Level 2**: Partial Driving Automation. Your car requires you to control steering and acceleration/deceleration on most roads. On large, multi-lane highways the vehicle is equipped with cruise-control which can maintain your desired speed, or match the speed of the vehicle to that of the vehicle in front, autonomously. The car can also follow the highway’s lane markings and change between lanes autonomously, but may require you to retake control with little or no warning in emergency situations.**Level 3**: Conditional Driving Automation. Your car can drive partially autonomously on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take control and steer, accelerate/decelerate and switch lanes as appropriate. The car is aware of potential emergency situations, but if it encounters a confusing situation which it cannot handle autonomously then you will be alerted and must retake control within a few seconds. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.**Level 4**: High Driving Automation. Your car can drive fully autonomously only on large, multi-lane highways. You must manually steer and accelerate/decelerate when on minor roads, but upon entering a highway the car can take full control and can steer, accelerate/decelerate and switch lanes as appropriate. The car does not rely on your input at all while on the highway. Upon reaching the exit of the highway the car indicates that you must retake control of the steering and speed control.**Level 5**: Full Driving Automation. Your car is fully autonomous. You are able to get into the car and instruct it where you would like to travel to, the car then carries out your desired route with no further interaction required from you. There are no steering or speed controls as driving occurs without any interaction from you.

Before you read on, think about each of those levels. What do you think are the advantages and disadvantages of each? Which would you be comfortable with and why?

We sent our survey to 187 drivers recruited from across the USA, and here’s what we found:

We found that on many measures, people report a lower acceptance of higher automation levels. People perceive higher autonomy levels as being less safe, they report lower intent to use them, and higher anxiety with higher autonomy levels.

We compared some of the results with those from an earlier study, conducted in 2014. We had to make some simplifying assumptions, as the 2014 study wasn’t conducted with the AVAM. However, we still found that our results were mostly similar: both studies found that people (unsurprisingly) expected to have to do less as the level of autonomy increased. Both studies also found that people showed lower intent to use higher autonomy vehicles, and poorer general attitude towards higher autonomy. Self-driving cars seem to be suffering in public opinion!

We asked people how much they would expect to have to use their hands, feet and eyes while using a vehicle at each level of autonomy. Even though vehicles at the intermediate levels of autonomy (3 and 4) can do significantly more than levels 1 and 2, people did not perceive the higher levels as requiring significantly less engagement. However, at level 5 (full autonomy), there was a dramatic drop in expected engagement. This was an interesting and new finding (albeit not entirely surprising). One explanation for this is that people only really perceive two levels of autonomy: partial and full, and don’t really care about the minor differences in experience with different levels of partial autonomy.

All in all, we were fascinated to learn about people’s attitudes to self-driving cars. Despite the enthusiasm displayed by the tech media, there seems to be a consistent concern around their safety and reluctance to adopt amongst the general public. Even if self-driving cars really do end up being safer and better in many other ways than regular cars, automakers will still face this challenge of public perception.

And now, a summary poem:

*The iron beast has come alive,
We do not want it, do not want it
Its promises we do not prize
It does not do as we see fit*

*Only when we can rely
On iron beast with its own eye
Only then will we concede
And disaffection yield to need*

If you’re interested in using our questionnaire or our data, please reach out! I’d love to help you build on our research.

Want to learn more about our study? Read it here (click to download PDF) or see the publication details below:

Charlie Hewitt, Ioannis Politis, Theocharis Amanatidis, and Advait Sarkar. 2019. Assessing public perception of self-driving cars: the autonomous vehicle acceptance model. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI ’19). ACM, New York, NY, USA, 518-527. DOI: https://doi.org/10.1145/3301275.3302268

]]>Ever graded an essay? Given scores to interview candidates? Given a rating to an item on Amazon? Liked a video on YouTube?

We’re constantly asked to rate or score things on absolute scales. It’s convenient: you only have to look at each thing once to give it a score, and once you’ve got a set of things all reduced to a single number, you can compare them, group them into categories, and find the best one (and the worst).

However, a growing body of evidence points to the fact that humans are simply not very good at giving absolute scores to things. By not very good, we mean there are two problems:

- Different people give different scores to the same thing (low inter-rater reliability)
- The same person can give different scores to the same thing, when asked to score it repeatedly (low intra-rater reliability)

But don’t worry! There’s a better way: ordering things, not scoring them. Let me illustrate with two case studies.

A cool modern application of artificial intelligence / machine learning is “lexical simplification”, which is an ironically fancy way of saying “making complex text easier to read by substituting complex words with simpler synonyms”. This is a great way to make text accessible to young readers and those not fluent in the language. Finding synonyms for words is easy, but detecting which words in a sentence are “complex” is hard.

To teach the AI system what counts as a complex word and what doesn’t, we need to give it a bunch of labelled training examples. That is, a list of words that have already been labelled by humans as being complex or not. Now traditionally, this dataset was generated by giving human labellers some text, and asking them to select the complex words in that text. This is a simple scoring system: every word is scored either 1 or 0, depending on whether the word is complex or not.

However, we knew from previous research that people are inconsistent in giving these absolute scores. So, my student Sian Gooding set out to see if we could do better. She conducted an experiment where half the participants used the old labelling system, and the other half used a sorting system. In the sorting system, participants were given some text, and asked to order the words in that text from least to most complex.

We found that with the sorting system, participants were far more consistent and created a far better labelled training set!

The Microsoft ASSESS-MS project aimed to use the Kinect camera (which captures depth information as well as regular video) to assess the progression of multiple sclerosis. The idea is that because MS causes degeneration of motor function that manifests in movements such as tremor, it should be possible to use computer vision to track and understands a patient’s movements with the Kinect camera, and assign them a score corresponding to the severity of their illness.

To train the system, we first needed a set of labelled training videos. That is, videos of patients for which neurologists had already provided the severity of illness scores. The problem was that the clinicians were giving scores on a standardised medical scale of 0 to 4, but their scores were suffering from poor consistency! With inconsistent scores, there was little hope that the computer vision system would learn anything.

*The video illustrates our deck sorting interface for clinicians*

Our solution was to ask clinicians to sort sets of patient videos. We found that giving clinicians “decks” of about 8 videos to sort in order of illness severity worked well – any more than that and the task became too challenging. But we wanted them to rate nearly 400 videos. To go from orderings of 8 videos at a time, to a full set of orderings for the entire dataset, we needed an additional step. For this, we used the TrueSkill algorithm, which is able to merge the results from many orderings (how exactly we did this is detailed in our paper, which you can read here (PDF)).

To our amazement, we found that the resulting scores were significantly more consistent than anything we had previously measured, and handily exceeded clinical gold standards for consistency.

It’s not yet clear *why *people are so much better at ordering than scoring. One hypothesis is that it requires people to provide less information. When you score something on a scale of 1-10, you have 10 choices for your answer. But when you compare two items A and B, you only have 3 choices: is A less than B, or is B less than A, or are they equal? However, this hypothesis doesn’t explain what Sian and I saw in the word complexity experiment, since in the scoring condition, users were only assigning scores of 0 or 1. Another hypothesis is that considering how multiple items relate to each other gives people multiple reference points, leading to better decisions. More research is required to test these hypotheses.

People are asked to score things on absolute scales all the time, but they’re not very good at it. We’ve shown that people are significantly better at ordering things in a variety of domains, including identifying complex words, and assessing multiple sclerosis, although we’re not quite sure why.

The next time you find yourself assigning absolute scores to things – try ordering them instead. You might be surprised at the clarity and consistency it brings!

And now, a summary poem:

*I wished to know the truth about this choice
And with no guide I found myself adrift
No measure, no register, no voice
But when juxtaposed with others,
brought resolution swift.*

*Black and white, true and false, desire:
Nature makes a myriad form of each.
Context drives our understanding higher,
To compare things brings them well within our reach.*

Want to learn more about our studies? See the publication details below:

Sarkar, Advait, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff et al. “Setwise comparison: Consistent, scalable, continuum labels for computer vision.” In *Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems*, pp. 261-271. ACM, 2016. https://doi.org/10.1145/2858036.2858199. Download PDF

Gooding, Sian, Ekaterina Kochmar, Alan Blackwell, and Advait Sarkar. “Comparative judgments are more consistent than binary classification for labelling word complexity.” In *Proceedings of the 13th Linguistic Annotation Workshop*, pp. 208-214. 2019. https://doi.org/10.18653/v1/W19-4024. Download PDF

Steinheimer, Saskia, Jonas F. Dorn, Cecily Morrison, Advait Sarkar, Marcus D’Souza, Jacques Boisvert, Rishi Bedi et al. “Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support–a proof of concept study.” *Disability and rehabilitation* (2019): 1-7. https://doi.org/10.1080/09638288.2018.1563832

The year is 2019. Voice-controlled digital assistants are great at simple commands such as “set a timer…” and “what’s the weather?”, but frustratingly little else.

Human language seems to be an ideal interface for computer systems; it is infinitely flexible and the user already knows how to use it! But there are drawbacks. Computer systems that aim to understand arbitrary language are really hard to build, and they also create unrealistic expectations of what the system can do, resulting in user confusion and disappointment.

The next frontier for voice assistants is complex dialogue in challenging domains such as managing schedules, analysing data, and controlling robots. The next generation of systems must learn to map ambiguous human language to precise computer instructions. The mismatch between user expectations and system capabilities is only worsened in these scenarios.

What if we could preserve the familiarity of natural language, while better managing user expectations and simplifying the system to boot? That’s exactly what my student Jesse Mu set out to study. The idea was to use what we called a *restricted language interface*, one that is a well-chosen subset of full natural language.

Jesse designed an experiment where participants played an interactive computer game called *SHRDLURN*. In this game, the player is given a set of blocks of different colours, and a “goal”, which is the winning arrangement of blocks. The player types instructions to the computer such as “remove the red blocks” and the computer tries to execute the instruction. The interesting bit is that the computer doesn’t understand language to begin with. In response to a player instruction, it presents the player with a list of block arrangements, and the player picks the arrangement that fits their instructions. Over time, the computer learns to associate instructions with the correct moves, and the correct configuration starts appearing higher up in the list. The system is perfectly trained when the first guess on its list is always the one the player intended.

Sixteen participants took part in our experiment. Half of them played the game with no restriction, but the other half were given specific instructions: they were only allowed to use the following 11 words: *all, cyan, red, brown, orange, except, leftmost, rightmost, add, remove, to*.

We measured the quality of the final system (i.e., how successfully the computer learnt to map language to instructions) as well as the cognitive load on participants. We found, unsurprisingly, that in the **non**-restricted setting people used a much wider variety of words, and much longer sentences. However, the restricted language participants seemed to be able to train their systems more effectively. Participants in the restricted language setting also reported needing less effort, and perceived their performance to be higher.

By imposing restrictions, we achieved the same or better system performance, without detriment to the user experience – indeed, participants reported lower effort and higher performance. We think that a guided, consistent language helps users understand the limitations of a system. That’s not to say we’ll never desire a system that understands arbitrary human language. But given the current capabilities of AI systems, we will see diminishing returns in user experience and performance by attempting to accommodate arbitrary natural language input. Rather than considering one of two extremes – a specialised graphical user interface vs a completely natural language interface, designers should consider restricted language interfaces which trade-off full expressiveness for simplicity, learnability and consistency.

Here’s a summary in the form of a poem:

*It was not meant to be this way
*

*This human dance of veiled intent
The spoken word and written hand*

*But let us meet at halfway point
And share our thoughts with less*

*To know each other’s will and wish
— not guess*

Want to learn more about our study? Read it here (click to download PDF) or see the publication details below:

Mu, Jesse, and Advait Sarkar. “Do We Need Natural Language?: Exploring Restricted Language Interfaces for Complex Domains.” In *Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems*, p. LBW2822. ACM, 2019. https://dl.acm.org/citation.cfm?doid=3290607.3312975

To provide more scalable and accessible treatment, we could use Artificial Intelligence-driven chatbots to provide a therapy session. It might not (currently) be as effective as a human therapist, but it is likely to be better than no treatment at all. At least one study of a chatbot therapist has shown limited but positive clinical outcomes.

My student Samuel Bell and I were interested in finding out whether chatbot-based therapy could be effective not just clinically, but also in terms of how patients *felt* during the sessions. Clinical efficacy is only one marker of a good therapy session. Others include **sharing ease** (i.e., does the patient feel able to confide in the therapist), **smoothness of conversation**, **perceived usefulness**, and **enjoyment**.

To find out, we conducted a study. Ten participants with sub-clinical stress symptoms took part in two 30-minute therapy sessions. Five participants had their sessions with a human therapist, conducted via chat through an internet-based CBT interface. The other five had therapy sessions with a simulated chatbot, through the same interface. At the end of the study, all participants completed a questionnaire about their experience.

We found that in terms of sharing ease and perceived usefulness, neither the human nor the simulated chatbot emerged the clear winner, although participants’ remarks suggested that they found the chatbot less useful. In terms of smoothness of conversation and enjoyment, the chatbot was clearly worse.

Participants felt that the chatbot had a poor ability to “read between the lines”, and they felt that their comments were often ignored. One participant explained their dissatisfaction:

“It was a repetition of what I said, not an expansion of what I said.”

Another participant commented on the lack of shared experience:

“When you tell something to someone, it’s better, because they might have gone through something similar… there’s no sense that the robot cares or understands or empathises.”

Our study has a small sample size, but nonetheless points to clear deficiencies in chatbot-based therapy. We suggest that future research into chatbot CBT acknowledges and explores these areas of conversational recall, empathy, and the challenge of shared experience, in the hope that we may benefit from scalable, accessible therapy where needed.

Want to learn more about our study? Read it here (PDF) or see the publication details below:

Bell, Samuel, Clara Wood, and Advait Sarkar. “Perceptions of Chatbots in Therapy.” In *Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems*, p. LBW1712. ACM, 2019. https://dl.acm.org/citation.cfm?id=3313072

Every day, designers create the world around us: every website you’ve visited, book or magazine you’ve read, every app you’ve used on your phone, every chair you’ve sat on, almost everything around you has been consciously designed by someone.

Since this activity is so important, an area of academia known as *design research* is concerned with studying how design works, and how to do it better. The ultimate aim for design research is to create a *theory* for designing a particular thing (such as websites, or books, or apps, or chairs) that teaches us how to design that thing well. One way to try and produce these theories is to actually design a bunch of things (websites or books or apps or chairs), and then document what worked and what didn’t. This is the basic idea behind *research through design*. In the rest of this post, I’ll explain a bit more about research through design, as well as what we can realistically expect from the theories we produce using this process.

**What’s research through design?**

Design shifts the world from its current state into a “preferred” state through the production of a designed artefact. My PhD dissertation described the design of two visual analytics tools I developed, with a focus on documenting and theorising those aspects of the design that (a) facilitate the specific user tasks I identified as being important and (b) reduced expertise requirements for users. Thus, the approach to knowledge production was *research through design* (Frayling, 1993). The distinction between research through design, and merely design, is one of intent. In the former, design is practiced with the primary intent of producing knowledge for the community of academics and practitioners. Consequently, the design artefact cannot stand in isolation – it must be accompanied by some form of discourse intended to communicate the embodied knowledge result to the community. Moreover, this discourse must make explicit how the artefact is sufficiently novel to contribute to knowledge. In a non-research design activity, neither annotative discourse nor novelty is necessary for success.

Zimmerman et al. (2007) propose four general criteria for evaluating research through design contributions: process, invention, relevance, and extensibility. Process refers to the rigour and rationale of the methods applied to produce the design artefact, and Invention refers to the degree of academic novelty. Relevance refers to the ability of the contribution to have a wider impact. Extensibility is the ability of the knowledge as documented to be built upon by future research.

**Why is there no such thing as a complete theory of design?**

When designing systems in some domain, it may seem an attractive proposition to seek a theory of design that not only characterises specifically the nature of these systems and how their important properties may be measured, but also prescribes a straightforward, deterministic strategy for the design of such systems. When I started out in my research, I wanted to produce a theory for how to design systems that would let non-experts use visual tools to perform statistics and machine learning. I initially anticipated that such a prescriptive theory would be elusive for multiple reasons, including the nascency of interactive machine learning, the incomplete characterisation of potential applications, and a wariness of the challenges surrounding “implications for design” (Stolterman, 2008).

Towards the end of my PhD, I came to the position (and I still hold it) that a *complete* design theory is not only elusive, but impossible – not just for visual analytics tools, but *any* design domain. This is because theory underspecifies design, and design underspecifies theory (Gaver, 2012). Theory underspecifies design because a successful design activity must culminate as an *ultimate particular* (Stolterman, 2008): an instantiated, designed artefact, subject to innumerable decisions, situated in a particular context, and limited by time and resource constraints. Design problems are inherently *wicked problems* (Buchanan, 1992); they can never be formulated to a level of precision which affords ‘solving’ through formal methods, and no theory for design can profess to provide a recommendation for every design decision. Conversely, design underspecifies theory, in the sense that an ultimate particular will fail to exemplify some, or even many, of the nuances captured in an articulated theory.

This is not to say that we should do away with theory altogether and focus solely on artefacts themselves. Gaver’s view, to which I am sympathetic, is that design theory is “provisional, contingent, and aspirational”. The aim of design theory is to capture and communicate knowledge generated during the design process, in the belief that it may sometimes, but not always, lead to successful designs in the future.

*This post is heavily based on an excerpt from my 2016 PhD thesis:
*Interactive analytical modelling,

Frayling, Christopher. Research in art and design. Royal College of Art, London, 1993.

Zimmerman, John; Forlizzi, Jodi, and Evenson, Shelley. Research through design as a method for interaction design research in HCI. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 493–502. ACM, 2007.

Stolterman, Erik. The nature of design practice and implications for interaction design research. International Journal of Design, 2(1), 2008.

Gaver, William. What should we expect from research through design? In Proceedings of the SIGCHI conference on human factors in computing systems, pages 937–946. ACM, 2012.

Buchanan, Richard. Wicked problems in design thinking. Design issues, 8(2):5–21, 1992.

]]>Have you ever wondered whether doctors are consistent in their judgements? In some cases, they really aren’t. When asked to rate videos of patients with multiple sclerosis (a disease that causes impaired movement) on a numeric scale from 0 being completely healthy to 4 being severely impaired, clinicians struggled to be consistent, often giving the same patient different scores at different times, and disagreeing amongst themselves. This difficulty is quite common, and not unique to doctors — people often have to assign scores to difficult, abstract concepts, such as “How good was a musical performance?” or “How much do you agree or disagree with this statement?” Time and time again, it has been shown through research that people are fundamentally inconsistent at this type of activity, no matter the setting or level of expertise.

The field of ‘machine learning’, which can help to automate such scoring (e.g. automatically rating patients according to their disability), is based on the method that we can give the computer a set of examples for which the score is known, in the hope that the computer can use these to ‘learn’ how to assign scores to new, unseen examples. But if the computer is taught from examples where the score is inconsistently assigned, the result is that the computer learns to assign inconsistent, unusable scores to new, unseen examples.

To solve this problem, we brought together an understanding of how humans work with some mathematical tricks. The fundamental insight is that it is easier and more consistent for humans to provide preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”). The problem is, even if you have as few as 50 items to assign scores, you already have 50 x 49 = 2450 ways of pairing them together. This balloons to nearly 10,000 comparisons when you have 100 items. Clearly, this doesn’t scale. So we scale this using a mathematical insight: namely, that if you’ve compared A to B, and B to C, you can guess with reasonably high accuracy what the relationship is between A and C. This ‘guessing’ is done with a computer algorithm called TrueSkill, which was originally invented to help rank people playing multiplayer games by their skill, so that they could be better matched to online opponents. Using TrueSkill, we can reduce the number of comparisons required by a significant amount, so that increasing the number of items no longer results in a huge increase in comparisons. This study has advanced our understanding of how people quantify difficult concepts, and has presented a new method which balances the strengths of people and computers to help people efficiently and consistently provide scores to many items.

This study shows a new way to quickly and consistently have humans rate items on a continuous scale (e.g. “rate the happiness of the individual in this picture on a scale of 1 to 5”). It works through the use of preference judgements (e.g. “is this higher/lower/equal to that?”) as opposed to absolute value judgements (e.g. “is this a 4 or a 5?”), combined with an algorithmic ranking system which can reduce the need to compare every item with every other item. This was initially motivated by the need to have higher-quality labels for machine learning systems, but can be applied in any domain where humans have difficulty placing items along a scale. In our study we showed that clinicians can use our method to achieve far higher consistency than was previously thought possible in their assessment of motor illness.

We built a nifty tool to help clinicians perform Setwise Comparison, which you can see in the video below: https://www.youtube.com/watch?v=Q1hW-UXU3YE

This study describes a novel method for efficiently eliciting high-consistency continuous labels, which can be used as training data for machine learning systems, when the concept being labelled has unclear boundaries — a common scenario in several machine learning domains, such as affect recognition, automated sports coaching, and automated disease assessment. Label consistency is improved through the use of preference judgements, that is, labellers sort training data on a continuum, rather than providing absolute value judgements. Efficiency is improved through the use of comparison in sets (as opposed to pairwise comparison), and leveraging probabilistic inference through the TrueSkill algorithm to infer the relationship between data which have not explicitly been compared. The system was evaluated on the real-world case study of clinicians assessing motor degeneration in multiple sclerosis (MS) patients, and was shown to have an unprecedented level of consistency, exceeding widely-accepted clinical ‘gold standards’.

If you’re interested in learning more, we reported this research in detail in the following publications:

Setwise Comparison: Consistent, Scalable, Continuum Labels for Machine Learning

**Advait Sarkar**, Cecily Morrison, Jonas F. Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D’Souza, Peter Kontschieder, Samuel Rota Bulò, Lorcan Walsh, Christian P. Kamm, Yordan Zaykov, Abigail Sellen, Siân E. Lindley

*Proceedings of the 34 ^{th} Annual ACM Conference on Human Factors in Computing Systems (CHI 2016) (pp. 261–271)*

Setwise comparison: efficient fine-grained rating of movement videos using algorithmic support – a proof of concept study

Saskia Steinheimer, Jonas F. Dorn, Cecily Morrison, **Advait Sarkar**, Marcus D’Souza, Jacques Boisvert, Rishi Bedi, Jessica Burggraaff, Peter Kontschieder, Frank Dahlke, Abigail Sellen, Bernard M. J. Uitdehaag, Ludwig Kappos, Christian P. Kamm

*Disability and Rehabilitation, 2019*

(This was a writeup of our 2016 CHI paper for a medical audience)

**Preparation**: Acquire 4 external hard drives, each as large as you wish, all of roughly the same capacity. I will refer to them as**A1**,**A2**,**I1**and**I2**.**Archival drives**: Drives A1 and A2 are*archival*drives. They contain data that you no longer keep on your primary computer, and data that you no longer expect to change. This might include photos, music, and old work. You must ensure that A1 and A2 always have the same content as each other.**Incremental backup drives**: Drives I1 and I2 are*incremental backup*drives. They will contain a versioned history of all the files on your primary computer. For instance, you can set them both to be Time Machine drives. Time Machine is the incremental/differential backup software that comes standard with Mac OS X (alternative solutions are available for other operating systems).**Location**: Drives A1 and I1 are stored at the same primary location, such as your home. Drives A2 and I2 are stored a*different*, secondary location, such as your workplace.**What you need to do**: Update the content on A1 and A2 at your convenience, making sure they are always in sync. Make incremental backups with I1 and I2 as frequently as possible (at least once daily). With Time Machine this amounts to merely plugging in the drive (or connecting to the same network as the drive, if you use Time Capsule, or you can use something like a Transporter).

And that’s it.

- Under the event of data loss due to a
*hardware or software*failure, that is, if one of the drives fails or the data on one of the drives gets corrupted, there is always another drive with a copy of the same data. This drive may be used until the failed/corrupt drive is replaced. - Under the event of data loss due to
*human error*, such as accidentally deleting or overwriting a file, there are two incremental backups from which any historic version of the file can be restored. - Under the event of data loss due to
*natural**disasters*(such as a fire, power surge, or flood) or*theft*, which causes the drives in one location to be destroyed or stolen, there is always a duplicate of the drives in another location which may be used until the destroyed/stolen drives are replaced. This is what is known as an offsite backup.

*Both archival drives or both incremental backup drives failing simultaneously*: this is extremely unlikely, but if you’re worried about it you can add a third drive of each type.*Failure to make incremental/archival backups often enough*: this is your problem, not a problem with the scheme.

This scheme can be directly implemented if:

- You primarily use one computer, which is a Mac
- Your day-to-day work does not create huge (i.e. comparable to the size of your hard drive), constantly changing files
- You do not care for third party services or cloud services (which often require recurring monthly fees)
- You are somewhat conscious of but not too restricted by price
- You are okay with waiting a few hours to get going again from your backups in case the hard drive in your computer fails and you can no longer boot

If the above do not apply to you, it is easy to adapt this solution for other use cases. For instance, you can easily modify the solution if:

**You use Windows/Linux**: I believe Windows has an equivalent to Time Machine called “Windows Backup“. Linux users can probably fend for themselves and find something that works for them.**You primarily use multiple computers**: You will need an additional pair of incremental backup drives for each additional computer you use.**You need to be able to immediately continue from where you left off in case your computer stops working**: You will need to start creating*bootable clones*, which can be achieved using software such as Disk Utility (comes standard with Mac OS X), SuperDuper or Carbon Copy Cloner. For Windows users, Windows Backup can also create bootable clones. These can be stored on additional drives or on your archival drives.**You don’t mind third party or cloud services**: I recommend looking into a solution such as Crashplan or BackBlaze. You can use these services to augment the 4 drive solution or to replace it entirely, depending on your level of trust and the quality of your Internet connection.**You are extremely price conscious**: It is possible to implement this scheme with only two drives. In this scenario you will have to create two partitions on each drive, one for archival and the other for the incremental backup. The drives must of course still be stored at separate locations. I personally prefer the 4 drive version because (1) hard drives are not yet capacious enough that cheap commodity drives can be partitioned into useful sizes for those with lots of data, (2) partitioning necessitates erasing the drive, (3) I am leery of increased opportunities for filesystem corruption with multiple partitions, and (4) it is much less effort to replace drives if they only serve a single purpose.

Since you will be acquiring multiple drives, you have the opportunity to spread your risk even further. By buying drives from different brands, you reduce your vulnerability if any single manufacturer or hard drive model has a faulty run. It is also good to have a mix of hard drive ages, since very young as well as very old drives appear to have a higher failure rate than those between the ages of 1 and 3 years.

I hope this is of some use. I was tired of thinking about backups and tired of researching third party backup solutions, so I settled on this compact, no-frills setup that can cope with all major threats to your data. If you have a suggestion or notice a deficiency, please leave a comment!

]]>- A non-significant result does
**not**allow you to “accept” the null hypothesis. - A high statistical power does
**not**allow you to “accept” the null hypothesis. - If you find yourself wanting to “prove” the null hypothesis when you are testing whether one variable affects another in a meaningful way, the proper way to do it is through
**equivalence testing**.

This is a ridiculously common mistake. Suppose you’re comparing the heights of a group of men against the heights of a group of women using a t-test. The t-test spits out a p-value of 0.3, which is higher than your chosen significance level 0.05. Surely this means that the null hypothesis, which is that the group means are equal, is true, right?

**Wrong!**

Okay, what if the t-test spits out a p-value of 0.99, this means that there is a 99% chance that the group means are equal right?

**Wrong!**

If your p-value is greater than your significance level, you **cannot** conclude that the group means are equal. You can **only** conclude that your data does **not** refute the hypothesis that your group means are equal. The p-value is probability of the t-test statistic being at least as extreme as the one you observe, assuming the group means are equal.

The logic we engage in with null hypothesis testing is this:

- If the null were true, we would not observe this data. (N → ¬D)
- We have observed this data. (D)
- Therefore the null is not true. (∴¬N)

This logic is sound because if N were true, D would be false. D is true, therefore N must not be true.

The faulty logic we engage in when we try “accepting” the null is this:

- If the null were true, we would not observe this data. (N → ¬D)
- We have not observed this data. (¬D)
- Therefore the null is true. (∴N)

This logic is unsound because D being false does not allow us to draw any inferences about N. It may be true or false. If D is true, we know that N must not be true either, but there may be many reasons for D being false other than N. This fallacy has a name: affirming the consequent. Consider the following analogy which is exactly identical to our faulty logic:

- If the petrol tank is empty, the car will not move. (N → ¬D)
- The car will not move. (¬D)
- Therefore the petrol tank is empty. (∴N)

Do you see why this is wrong? There might be many other reasons that the car will not start, for example, the ignition may be broken, the wheels may be missing, or the car may have hit a wall. However, it is perfectly sound to say that if the car moves (D), then we reject the null hypothesis that the petrol tank is empty (∴¬N).

**Bottom line**: a non-significant p-value is not evidence of the null.

“Okay fine, I get that you can’t use a non-significant p-value to support the null hypothesis.” I hear you say. “But I recall from my stats course that the *power* of a statistical test is the probability of correctly rejecting the null hypothesis. Surely this means that if my statistical power is pretty high, say 0.95, and my t-test fails to reject the null hypothesis, then there is a 95% chance that there is really no difference between the groups?”

**Wrong!**

Let us look a little more carefully at the actual definition of the power of a statistical test, and what it is useful for. The power of a statistical test is defined as the probability of rejecting the null hypothesis, *given that* the null hypothesis is indeed false. Power nearly always depends on (a) the level of statistical significance **α** at which you wish to reject the null hypothesis, (b) the magnitude **M** of the effect size of interest, and (c) the size **S** of the sample. So the power of a t-test is the probability that you observe a difference large enough in your sample **S** to be significant at the level **α**, given that there exists a true difference **M**. This is useful if you want to calculate how large your sample should be to be reasonably confident of detecting a true difference of a certain magnitude (or alternatively, what magnitude of difference you will be reasonably confident of detecting for a certain sample size).

“But that’s what I said!” I hear you say. “So if you’re 99% confident of detecting a true difference, and you fail to detect a true difference, this must mean that there is a high probability that there is no difference, right?”

**Wrong!**

A high statistical power is just as useless in conclusively telling you anything about whether the null hypothesis is true. Concretely, in the case of a t-test, a high statistical power combined with a non-significant p-value does **not **allow you to claim that the null hypothesis is true.

To see why this is the case, consider the reasoning that we *think* we follow when we do this:

- If the null hypothesis were false, then we will reject it. (¬N → R)
- We have not rejected it. (¬R)
- Therefore the null hypothesis is true. (∴N)

This is actually logically sound, and is exactly the same logic we engage in when we do standard null hypothesis significance testing. The problem is that this is an incorrect translation of the problem to logic. In this case, it is not okay to go from the probability ℙ(reject null | null is false) to ℙ(null is false → reject null). The latter is the same as ℙ( ¬ null is false OR reject null), i.e. ℙ(null is true OR reject null). The former is the proportion of tests that detect true effects, whereas the latter is the proportion of tests that detect effects plus the proportion of tests where there genuinely was no effect.

Consider this analogy, courtesy of David Poole and Alan Mackworth:

“Suppose you have a domain where birds are relatively rare, and non-flying birds are a small proportion of the birds. Here *P(¬flies | bird)* would be the proportion of birds that do not fly, which would be low. *P(bird →¬flies)* is the same as *P(¬bird ∨ ¬flies)*, which would be dominated by non-birds and so would be high. Similarly, *P(bird →flies)* would also be high, the probability also being dominated by the non-birds. It is difficult to imagine a situation where the probability of an implication is the kind of knowledge that is appropriate or useful.”

Moreover, since we calculate the power directly based on the significance **α** we wish to achieve, power is a function of p-value; that is to say, they have a 1:1 relationship. Moreover, they have an inverse relationship: the lower the p-value, the higher the power. Why? Because all else being equal, a test with a higher power should be able to detect an effect with a higher significance level, and therefore a lower p-value.

Bearing this in mind, it is *contradictory* to use high power as evidence for the null, since a high power corresponds to a low p-value, and we (correctly) use a low p-value as evidence *against* the null. If you run two tests and one of them has a higher power than the other, it does not provide more evidence *for* the null because it must also simultaneously have a lower p-value, which is evidence *against* the null. This fact is explained in much greater detail by Hoenig and Heisey in “The Abuse of Power” (pdf).

**Bottom line**: a high statistical power is not evidence of the null.

It is often the case that the very thing you want to prove is the absence of an effect. In this situation, you *cannot* use any test that assumes the absence of an effect as its null hypothesis. As the core of null hypothesis significance testing is proof by contradiction, you need to use a test that assumes the *presence* of an effect, and then show that the observed data is very unlikely under that assumption.

I will not attempt to describe these techniques in detail in this post, except to say that they are generally referred to as “equivalence testing” (a one-sided version of this, which assumes the presence of an effect in a particular direction, is often referred to as a “noninferiority test”). A very common way of doing this is through what is known as a “two one-sided test” or TOST test, explained very thoroughly by David Streiner in “Unicorns Do Exist” (pdf). It basically boils down to picking an “equivalence interval” such that our null hypothesis is “the difference between means is greater than this equivalence interval”. The alternative hypothesis then becomes “the difference between means is smaller than the equivalence interval”, i.e. the difference between means is sufficiently small that we consider them equivalent.

**Bottom line**: to show the absence of an effect, use a test where the null hypothesis is the presence of the effect.

Thanks for sticking around, and hopefully you have taken away 3 important things about “accepting” the null hypothesis. I urge you to look into equivalence testing in more detail and become comfortable and familiar with its techniques, and encourage your colleagues to be aware of these common fallacies. Additionally, I have to give credit to this wonderful series of blog posts that inspired this one. To conclude, remember:

- A high p-value does not mean the null hypothesis is true
- Neither does high power
- To show the absence of an effect, use equivalence testing

Best of luck!

]]>Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

- The integral of the PDF has no closed-form solution, and/or
- The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a *probabilistic* algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

- Choose an arbitrary starting point
in the space. Remember P(*x*) as given by the PDF.**x** - Jump away from
by a random amount in a random direction, to arrive at point**x**. If P(*x’*) is greater than P(**x’**), add**x**to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(**x’**)/P(**x’**).**x**

- If you have decided to add
to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from**x’**to some**x’**, and if you add**x”**, then jump away from it to**x”**etc). If you did not add**x”’**to the sequence, then return to**x’**and try to generate another**x**by jumping away again by a random amount in a random direction.**x’**

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from * x* is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(* x’*)/P(

A nifty trick is to not use the desired PDF to calculate P(* x*) directly, but instead to use a function

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence *converges* to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

There is one snag with Metropolis-Hastings: it might be too slow for some applications, because it can need quite a lot of samples before the generated distribution starts to match the desired distribution. One improvement is called Hamiltonian Monte Carlo. Instead of jumping in a random direction according to a normal distribution, think of being a ball rolling around the hilly area — as it goes down slopes, it rolls faster and gathers momentum, which it loses when it climbs up slopes. In practice, Hamiltonian Monte Carlo achieves a better approximation of the desired distribution in many fewer samples than Metropolis-Hastings.

]]>It works in really quite a simple manner. Let the cumulative distribution functions of the two distributions be CDF_{A} and CDF_{B} respectively. We simply measure the maximum difference between these two functions for any given argument. This maximum difference is known as the Kolmogorov-Smirnov *statistic*, D, and is given by:

You can think about it this way: if you plotted of CDF_{A} and CDF_{B} together on the same set of axes, D is the length of the largest vertical line you could draw between the two plots.

To perform the Kolmogorov-Smirnov *test*, one simply compares D to a table of thresholds for statistical significance. The thresholds are calculated under the null hypothesis that the distributions are equal. If D is too big, the null hypothesis is rejected. The threshold for significance depends on the size of your sample (as your sample gets smaller, your D needs to get larger to show that the two distributions are different) and, of course, on the desired significance level.

The test is non-parametric or distribution-free, which means it makes no assumptions about the underlying distributions of the data. It is useful for one-dimensional distributions, but does not generalise easily to multivariate distributions.

]]>