We typically think of charts as the end result of data analysis. To create a chart in Excel, you must first select some data. To produce a chart in Python or R using charting libraries, you must provide an array, data table or data frame. When William Playfair was inventing the line and bar charts in his Commercial and Political Atlas (1786), he conceived of them as ways of visualising economic data to give his readers greater understanding. Florence Nightingale’s beautiful rose diagrams in the 1850s were invented to visualise Crimean war mortalities (and in particular, how most of them were due to preventable disease, and not the war itself) as a rhetoric device in her quest to improve hygiene. In the same decade, John Snow’s cholera maps visualised the locations of disease clusters, helping authorities see the connection to the broad street water pump and the tainted water of the Lambeth Waterworks Company.
The idea that charts must be produced from data, as a way of depicting pre-existing data, is deeply ingrained in our tools and our historical uses for charts. This one-directional movement from data to charts makes intuitive sense, since how can a chart have any meaning in the absence of data? How could we even construct it?
But what if I told you it could go the other way: that the chart could come first, and be used to produce and control data? There are, in fact, some scenarios in which it is possible and useful to do this, and in this article I will share two examples.
Charts can generate data that improves human communication
At companies like BT, data analysts often have conversations with non-analysts. The non-analyst typically has a request. In the simplest cases, the request is just for data; the analyst is asked only to gather it and send it without further analysis. In other cases the non-analyst may request an insight, where the analyst is asked to investigate a specific question or hypothesis. Finally, the analyst might be asked to build a model for use by the non-analyst.
In these conversations, the analyst tries to achieve clarity about the request and create a shared understanding of the work that will be done, outcomes to be expected, and potential problems and limitations. Often the non-analyst’s first request is missing details. In guiding the non-analyst to refine a question, the analyst shares and exercises their domain and statistical knowledge.
The problem is that the data isn’t always available during these discussions. In large organisations, the data relevant to a particular request might be spread across many different files, tables, databases, and reside with different users and teams. It might require “cleaning”: analyst-speak for dealing with incorrect values, duplicates, missing values, and other data quality problems. Some data might require requisition forms and ethics approval processes before the analyst can even look at it. There are therefore many overheads in finding the right data to start analysis.
Not having data to look at can cause the discussion between analyst and non-analyst to suffer. It is much easier to achieve a shared understanding of the problem and the work that needs to be done if both participants in the conversation can see a common visual aid.
We envisioned a tool for both participants to sketch out and produce data during the conversation to facilitate the conversation. The analyst can construct a chart by dragging, dropping, and editing individual shapes called “kernels” onto a shared canvas. Think of kernels as building blocks of different shapes, such as lines, periodic waves and peaks that can be put together to create a dataset with any shape you want. As the analyst builds a chart, we generate data to match. Meanwhile, the non-analyst has access to annotations: they can draw arrows, speech bubbles, lines, and circles on the canvas to make their questions more specific to a certain aspect of the data. As the analyst shapes the dataset over the course of the conversation, and the non-analyst adds annotations, snapshots of the canvas are saved. These snapshots serve as a “graphical history” of the conversation, and enable users to reflect on how their thinking has developed, as well as refer and return to earlier ideas.
In a study, we asked analysts to use this tool to create datasets based on short textual descriptions of the data. We found that participants were able to use the tool quickly and that the kernel composition approach was intuitive. Moreover, participants really saw the value of such a data sketching tool in this conversations, saying, for example: “One of the biggest challenges I have in communicating to a client is to create data to try and describe what I’m expecting to see, so actually being able to manipulate the dataset like that is a really useful concept.”
Controls on charts can control the underlying data
Charts can also help us analyse large datasets, where due to the size of the data we cannot analyse all of it precisely and therefore resort to approximation. Analysing very large datasets can be challenging due to the hardware requirements involved. While modern consumer laptop and desktop computers are very fast, they are still often not powerful enough to give interactive performance on large datasets.
Consider a small business owner, perhaps a café, trying to look at 5 years of point-of-sales data to figure out which type of coffee is selling the most. If she sold 200 items a day, 300 days a year, that’s 300,000 data points to analyse. It is a large dataset to store and it will take a long time to compute the answer on a consumer PC, perhaps several minutes. Now imagine if that wasn’t the only question she had about this data, but was trying to interactively analyse it, by charting different views and drafting quick formulas. Such a workflow would be completely prohibited by a several-minute-long delay after each keypress or mouse click.
Large companies work around this by buying powerful hardware, storing their data in specialised database software in large data centres, and hiring expert data management professionals. But those solutions are out of reach for most small businesses and private individuals, who have a spreadsheet, a laptop, and a question.
There is a technique that can help: approximation. A family of algorithms known as probabilistic algorithms can provide near-instant results in exchange for a small, quantifiable chance of error, and these can be used to answer certain types of analytical questions. A more general-purpose approximation technique is sampling. Consider the café owner we met earlier. Instead of trying to analyse all 300,000 points, she could instead select a random sample of, say, 3,000 data points and compute the best-selling coffee in that sample. The estimate she would achieve this way does have a small chance of error, but through statistics it is possible to quantify the margin of error. She could repeat the process with multiple samples, or take a larger sample if she was unsatisfied with the level of accuracy in the estimate. This may not be necessary: if the estimate shows that the best-selling coffee vastly outsells the second-best-selling coffee, as is often the case in real-world data, then the difference between the two proportions is likely to be outside the error margin.
In practice, using approximation techniques is difficult, both because consumer analytics tools do not support them very well, but also because they have complex interfaces and many parameters. In order to use these techniques, the analyst needs to be able to assess the uncertainty in the estimate and use that assessment to decide what they will do next, whether they will seek to refine the estimate or whether it is good enough for their current purposes. Estimates of uncertainty are often visualised as error bars on scatter plots and bar charts, which are lines that indicate a window within which the value is likely to fall. The larger the window, the greater the uncertainty.
To make these techniques easy to understand and use, we thought of using the error bars themselves as a mechanism for controlling the estimation process. In our interface, the user drags the ends of error bars when they wish to reduce the uncertainty associated with a particular estimate. When dragging, a horizontal indicator appears, showing how long it will take to recompute. This “resource cost estimation” bar allows the user to judge whether they are willing to invest their resources (in this case, time) in exchange for an improvement in accuracy.
The user can request a specific amount of uncertainty by dragging the bars varying amounts. When they reach the desired level of uncertainty, they stop dragging, which starts the recomputation. While recomputation is performed, the cost estimation indicator shrinks to reflect the time elapsed, and the point and its error bars move to their newly accurate position.
We ran an experiment where 39 participants without formal training in statistics each used the interface to answer a series of simple statistical comparison questions. In each question, two points with error bars were given, and the participant was asked to describe the relationship between them: whether one was higher than the other, whether they were equal, or whether it was not possible to tell. The error bars were draggable, so the participants could refine the estimates before giving an answer.
We found that most participants successfully discovered the drag operation without it being explained beforehand; from this we can conclude that the interface corresponds well with users’ prior assumptions about how one might manipulate uncertainty in a visualisation. We found that most participants used drag operations throughout the experiment to give reasonable and correct answers about the data. Importantly, they gave answers that they could not have given without using the interface, because the initial uncertainty was too high to draw certain kinds of inferences, until the participant deliberately reduced it.
We also varied the estimated recomputation time so that in some tasks, it would take as little as 3 seconds to recompute the point with perfect certainty, but in others it would take up to 30 seconds. We found that this duration was negatively correlated with the amount that participants would choose to reduce uncertainty. Thus, the interface successfully communicated to users the implications of dragging the error bars by differing amounts, and led them to make informed decisions about how much computation they were willing to spend in refining their estimates.
We typically think of charts as coming after data, and as a way to visualise data that already exists, but that approach can be limiting. Charts can in fact be used to produce data, and also as a way to help control data. Our experiments have shown how data-producing charts can help analysts have better conversations with their colleagues, and how data-controlling charts can help non-experts interact with very large datasets on consumer hardware using sampling and approximation techniques.
And now, a summary poem:
Can there be count without account?
Or sense without amount?
But a baseless word is oft misheard
And time exchanged for smaller range.
We can know before we see,
and see before we know.
Mărăşoiu, Mariana, Alan F. Blackwell, Advait Sarkar, and Martin Spott. “Clarifying hypotheses by sketching data.” In Proceedings of the Eurographics/IEEE VGTC Conference on Visualization: Short Papers, pp. 125-129. 2016. Available at: https://advait.org/files/marasoiu_2016_sketching.pdf
Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Interaction with Uncertainty in Visualisations.” In EuroVis (Short Papers), pp. 133-137. 2015. Available at: https://advait.org/files/sarkar_2015_uncertainty_vis.pdf