Data Guards: Challenges and Solutions for Fostering Trust in Data
How I learned to stop worrying and trust a misleading visualization built on unvalidated data sources with conflicting calculations and legacy semantics
This blog discusses some of the data trust work conducted by Nicole Sultanum and Dennis Bromley of Tableau Research and Michael Correll of Northeastern University. The original paper can be read in its entirety on the Tableau Research site.
You are an investment manager in charge of allocating resources into different sectors—real estate, tech, cheese, etc. Your junior analysts have been diligently creating charts and graphs for you which show that tech is down and cheese is up. So, do you believe it? Is cheese really the future? The chart says it’s a no brainer, but do you trust the chart? How about the chart makers? Or the underlying data? How about yourself and your ability to interpret the chart?
When we talk about data trust, at least in this instance, we aren’t talking about malicious actors or security breaches; we are talking about confidence in your knowledge and understanding, and belief that you know what’s going on and that your decisions will be built on a solid foundation of factual truth.
This is also not just about correcting bugs in an ETL pipeline. The data could be perfect (is that even possible?) but the viz that shows you the data could be misleading. Or the viz and the presentation are correct but they use statistics that you just aren’t familiar with and thus cannot accurately interpret. At the end of the day, multiple factors can impact your interpretation of data and all of them must be trusted—i.e. confidently understandable—if you are to feel confident in your final decision.
The unreasonable effectiveness of being forced to make a decision
Our job at Tableau is to make the most useful and compelling interactive visual analytics tools in the world. But our customers’ job is infinitely harder; to make the best decisions possible given the potentially incomplete and confusing information available to them. For both its failures and its successes, this is how business has always been done. But what data can do—what we can do with data—is make that decision-making process more accurate, more efficient, and less stress inducing.
I totally trust what my data pipeline is telling me.
We became interested in this notion of trust and how it impacts peoples’ use of data. To this end, we undertook the Data Guards project, a user-centered investigation to capture the importance of trust to data users. We talked to a lot of people during this project—data scientists, dashboard creators, business decision makers—to understand how trust factored in their data work. Notably, literally nobody said that they unilaterally trusted their data pipeline, and cited many barriers to overcome before that could happen.
Barriers to trust
We asked people what was standing in their way of fully trusting the data that they used. Several threads emerged that underscored common barriers to trust. We share these barriers in the form of (somewhat hypothetical) scenarios.
- Data is super context dependent: You’re building a fifty story office building and you would love for it not to fall over. But the boutique dashboard consultant that you hired tucked all the “boring” engineering data that you need to actually make the building safe into a hidden tab. It's not clear that they know how important these values are, and now you are questioning if there is anything else they failed to capture in these vizzes they created.
- Detecting issues requires a discerning eye: You’re an experienced manager taking a look at your business expenses and something seems off—your nose tells you that this doesn’t feel like all the data you’ve seen before. Meanwhile, your rookie partner charges forward.
- Data trust builds on human relationships: You’ve been handed two charts. One of them was generated from an abandoned viz pipeline that was created by previous employees that you never met. The second chart was made by your friend who you know to be a knowledgeable domain expert with a track record of making accurate charts with clear takeaways. You naturally trust the second chart more than the first.
- Trust is hard to build, easy to lose: You are handed a new quarterly earnings report. Last quarter you presented the report to the CEO only to find out a day later that the numbers were wrong due to an ETL pipeline mistake. Recovering from this was difficult and embarrassing.
- Data definitions are often unclear or ambiguous: Year-end bonuses for everyone come out of a single pot of money based on how many ‘customer contacts’ they’ve logged. However, your bonus is calculated using one database and the newly acquired company is calculated using another. Moreover, their definition of ‘customer contact’ includes phone calls whereas yours requires in-person conversations. You feel that money is being unfairly allocated based on mismatching definitions of the same goal.
- Environments change and processes break: An important ETL pipeline has been in place for a long time. Last week your company moved over to a new database system that required reloading the data and rewriting many of the calculations. Values are no longer adding up.
Some (kind of) specific ideas
As we reflected on these challenges, mitigation strategies converged on aspects of improved communication: better capturing domain knowledge; providing more transparency over the ETL pipeline; helping all different stakeholders talk to each other more; and being proactive about anticipating issues.
Inspired by these findings and scenarios, we conceived and articulated seven different solution themes for possible trust-engendering tools, broken out into three different categories: Data overview, data details, and data community. These ideas address the trust barriers above and have relatively straightforward implementation interpretations.
Data overview
- Data and pipeline tests: These tests are comparable to software regression tests; the author describes some boolean test condition and the data passes that test condition or it doesn’t. If the test doesn't pass, everyone consuming the visualization is made aware that there may be an underlying data quality issue at play; for example, if percentages don't amount to 100%, or if there is a "February 30" date lurking in the time series. Tests could also be highly contextual and domain-relevant, like temperatures for a particular location falling outside of pre-established ranges. These tests could address any part of the data pipeline from ETL to visualization, providing structured means to capture domain knowledge and deterministic ways to assess data quality.
- Data quality agent: A data quality agent is an agent that proactively tells you if something "smells." This could point out data outliers, unusual amounts of null values, strangely high numbers, and other nuanced, experience-based assessments that are not as easily captured by boolean tests. Some numbers may be very domain-specific (how many classes should a student have every day?) and some may be mostly common sense (a public school probably never deals in billions of anything). Other things may be categorically wrong, e.g. two calculations with the same name that do different things. This could use AI/ML, but a simple and transparent algorithm might be a better choice.
- Data and pipeline update alerts: Upstream data changes can often break downstream values. If a user doesn’t know that those changes happened, they might wake up one morning and find that their data world is broken. Or worse, the changes are subtle enough that no one notices and everyone charges forward with incorrect data. Big idea: Let people know if something upstream changes. There may be conditions that you don’t know about that are critically important for downstream users.
Data details
- Explanation and status: People are unlikely to trust a visualization or dashboard if they have only a superficial understanding of what it’s doing. Dashboard explanation and status is about efficiently onboarding someone to a new dashboard. Communicate its purpose, what it's trying to show, how it’s handling the data, and what the dashboard can—and cannot—tell them. Tell them when and under what conditions it was made. Give them references to resources and people that can help them with any questions.
- Data traces: Data traces is about uncovering and understanding the provenance of a data point or data slice. Where did that outlier come from? Why are these values null? Why does this data smell funny? One way to communicate this is via an abstracted narrative of the ETL pipeline: or in other words, what transformations did the data go through before it reached the viz endpoint.
Data community
- Stamp of approval: This is the oldest one in the book; if someone you trust feels that some data is trustworthy, you will in turn feel a lot better about trusting it. While certified data sources are not new per se, it turns out that a lot of organizations simply don’t use them, which causes a lot of downstream consternation. Surfacing the individuals who are responsible for aspects of the ETL pipeline—and making it easier to reach them (or their orgs) when something goes wrong would be a way to harness some of that interpersonal trust back into the data and pipelines.
- Crowd wisdom: We don’t always have a single trusted certifier. But we sometimes have a community of people in the viz endpoints, such as dashboard users, who can collectively weigh in and help us establish trust. The idea is capturing that wisdom somehow—in the form of comments, annotations, and other documented traces. If all the senior people in your team or group think something is trustworthy, it’s a pretty good bet you can trust it. And if you do find something questionable, you have a persistent and shareable forum in which to discuss it.
Survey says…
We reached back to 10 of our contributors for feedback on these seven ideas. In general, people liked them! The figure below shows the seven ideas and how each contributor (C01, C02, etc.) stack ranked them. The ideas are stack ranked here by the most #1 votes, then the most #2 votes, etc.
A couple of takeaways were:
- Every idea was voted into the top two by at least one person. There were no air balls here.
- Stamp of approval was the most popular. People want to know that a trusted expert has signed off on a data set, relieving them of the responsibility of constant vigilance. It shows how important interpersonal trust still is. Or perhaps, it's a reflection of how overlooked it is, as a trust need, given there is virtually no tool support to mediate these interchanges.
Summary
We can’t just throw data over the visualization fence and call it a day. The last mile between data presentation and decision making is trust, and decision making is why people enlist data visualization in the first place. To close the deal, we need to help them trust what we are showing them.
If you’re able to join us (virtually) at the IEEE VIS 2024 Conference, we’ll be presenting this work along with several other Tableau Research projects. Please connect with us, we’d love to hear from you!
Related Stories
Subscribe to our blog
Get the latest Tableau updates in your inbox.