In Defense of the Untidy Data Table

Spreadsheets provide very human affordances that other data analytics tools do not.

Sitting_Room

Sitting rooms can be very ornate and expensive, but there’s not very much you can actually do in them. A play room is a little chaotic, sure, but a little chaos can be a good thing once in a while. On the left: photo by Davidlwinkler, CC BY-SA 4.0. On the right: photo by Elizabeth (table4five) from Lansing, MI, USA, CC BY 2.0, both via Wikimedia Commons

This blog post accompanies a paper to be presented at IEEE VIS 2021, “Untidy Data: The Unreasonable Effectiveness of Tables,” written by Lyn Bartram, Michael Correll, and Melanie Tory. For more details, read the paper: https://arxiv.org/abs/2106.15005!

The constant refrain from data scientists and other people who work with data is that they spend much (or even most) of their time cleaning, preparing, and otherwise wrangling data, activities that are nonetheless regarded as some of the least interesting or rewarding part of their jobs. One of the reasons why this kind of work is necessary is because many of the systems and structures we use to make sense of our data are a lot like a fancy sitting room with expensive furniture and white carpets: very formal and tidy places that do not brook intrusion or disorder or play. Just as learning and play happen in messy environments, our paper was a look at how untidiness in data, and particularly in data tables, are very human ways of making sense of data.

This work is the result of a months-long investigation into how people, especially people who don’t self-identify as data scientists (we call them “data workers” here, to denote that they still do lots of work with their data, they just lack the credentials or the ingroup status of the vaunted data scientist), make sense of their data. Our initial focus was broad and open-ended, with perhaps the idea that we’d look into the data prep pipeline, or find places where we could automate certain kinds of data cleaning. But our study of how data workers manipulate their data became, interview by interview, an exploration of the data table and the spreadsheet, the modalities that seemed inescapable when our participants described their experiences. Data tables enmeshed in spreadsheets were everywhere: in the work our participants did, in the ways they reported out to their stakeholders, and were even conspicuous in their absence — even for our participants that used purpose-built analytics tools, there was a frequent pressure to “drop out” back to the old reliable spreadsheet, and guilt whenever they did things the “easy” way in a table.

Why did our participants keep to their spreadsheets, when the rest of data science marketing, pedagogy, and research seems so focused on getting them to use more “sophisticated” tools (or want the work automated away from humans entirely?). We think it is due to the very human (and human-scale!) affordances that spreadsheets provide that other data analytics tools do not, and the very human (but untidy!) artifacts you can make with them.

Tidy_Table_Untidy_Table

Two versions of the same data, one “tidy” and one “untidy.” The tidy table is ready for use in your programming language or analytics tool of choice. The untidy version is not, but for a human reader it has many benefits, including semantically relevant spatial structure, annotations giving provenance information or highlighting areas of concern, and room for exploration and play.

While the paper covers our findings in more detail, we’ll focus here on the sorts of untidy (but useful) structures we saw, as well as the actions these structures enabled.

Once you are willing to accept a little untidiness in your data tables, your simple table can become everything from a presentation tool, lab notebook, diary, or decision-making dashboard: a home base for all of your data work. Sparse (or missing) data, observations at multiple levels of detail, or even multiple unrelated data sets? All problems if you want to keep things tidy or import your data into standard analysis tools, but a cinch to handle in a spreadsheet.

A common pattern we saw was that our data workers would keep what we call a “master table” that was their data from some vetted source or at an appropriate level of detail which they would keep pristine and untouched. Once the master table had been constructed — occasionally an arduous process! — workers would, in either the periphery of the cells around this central data table, or in a duplicate table, have a place where they could document or explore without worrying about messing up the original data.

In these “workout spaces” (as aptly named by one of our participants) we saw all kinds of interactions designed to produce a view of the data that was human readable, actionable, and trustworthy. Annotations, where participants would mark up individual cells with color or (occasionally arcane) symbols to indicate data values of interest, importance, or potential low quality. Marginalia, where participants would add new information (and running commentary) in the periphery of the tables (occasionally this commentary was deemed interesting enough that it would be turned into a new column of the master table). Marginalia and annotations supported all sorts of data activities, from orienting oneself to an unfamiliar data source, presenting data to stakeholders, triaging and filtering down information to just the most crucial areas, and generally enriching a dataset with all the context and expertise that a data worker can provide. These resulting tables also afford eyeballing, a nearly ubiquitous task for our workers that can include everything from looking for outliers or missing data, confirming that a complex calculation worked as expected (two of our participants even reported using handheld calculators to make sure things added up!), or looking for new information that required immediate follow-up.

One of the reasons why participants kept to their spreadsheets and tables was the ease with which they could add information, in their own languages and corresponding to their own mental models, to arbitrary forms of data. An example was the task of joining two datasets together, which in many formal tools requires a bit of database expertise (or at least lingo, to tell your left, right, inner, and outer joins apart) at best, and complex chains of tooling to perform data densification, sparsification, or recoding at worst. Rather than deal with that complexity, a few of our participants would just pivot the data into the form they wanted, sort by the columns the two datasets had in common, and copy and paste one table next to the other. These “copy and paste” joins were much easier for our participants to understand, perform, and verify.

Our interviews had many more anecdotes of this form. We could fit only so many into the paper, and even fewer into this blog post. But we argue that these rich, human-readable tables warrant both further study as well as increased support from our analytics tools. We all too often view work in spreadsheets as a mere stepping stone for more complex or automated data analytics workflows, but people have good reasons for using them. Until our more “advanced” tools can support the same kind of flexibility, ease of use, and human legibility as a person directly reading and editing a data table, there’s a limit to the kind of data work we can support.