Why Data Visualization is One of the Hardest but Most Important Tasks
Data visualization is an important part of a data scientist’s work—a picture is worth a thousand words! Even though we live in an era of big data and data-driven decisions, the truth is that, without data visualizations, we’d be largely lost scrolling through endless rows of numbers in spreadsheets. After all, less than 20% of the world’s population was literate only 200 years ago—but images have been powerful communication tools for tens of thousands of years.
Even in today’s highly literate world, most people still struggle to grasp the relationships between tables full of numbers. In this year’s State of Data Science survey, only 52% of respondents said that their organization’s decision-makers were mostly data literate. Against this backdrop, it makes sense that data visualizations can play a key role in communicating important concepts and trends. Even if data visualizations are a (relatively) newer development, history has shown their potential for significant impact on our lives. Perhaps that’s why data scientists today say they spend 15% of their time on visualizations.
At the same time, data visualizations aren’t a one-stop, silver-bullet solution for communicating the findings of data science work. Even for experienced practitioners, they can be tricky to do well. And given the effectiveness with which charts or graphics can lodge in our brains, if they don’t accurately or effectively represent the underlying data, there can be negative consequences.
To better understand where things can go wrong with data visualizations—and how to avoid those pitfalls—it’s helpful to break data viz down into two main types. The first is exploratory visualization: this is when data scientists use visualization tools in order to better understand their data, prior to drawing any conclusions or conducting a full analysis. The second is narrative visualization: this is what data scientists use to present or communicate their findings to a wider audience. Each type has its own potential challenges.
Exploratory data visualizations
With the advent of big data, exploratory data visualization became a key part of the data science process. When dealing with hundreds of thousands, millions, or billions of data points, it becomes impossible for an individual to find patterns and distributions in the dataset by simply staring at the individual values or by computing simple statistics. As a result, practitioners turn to data viz in order to grapple with their data and explore further analytical avenues. Unfortunately, traditional viz methods require careful manual tweaking to avoid problems like overplotting, undersaturation, and oversaturation, and manual tweaks are difficult to do safely when your only way to understand the data is through the visualization itself! If you’re using traditional viz techniques for exploratory analysis of big data, plotting problems that are just annoying for small data can lead you to completely incorrect conclusions for big data.
To avoid these issues, it’s important to be cognizant of the limitations of any tools you’re using with exploratory data visualizations and to think critically about how the display of the information is impacting your assessment of it. One of the reasons why we created Datashader several years ago was to help tackle this challenge. Datashader is an open-source library that automatically creates accurate representations of datasets of any size, without the need for manually tuning parameters like transparency or point sizes. Our goal with Datashader is to let practitioners explore their large datasets as they truly are, warts (well, outliers!) and all, without having to know what they’re looking for at the start.
Whether you use Datashader or another tool, the key to avoiding problems with exploratory data visualization is to make sure that you’re not squeezing the texture and nuance out of your data as you subconsciously try to make it match a preconceived hypothesis.
Narrative data visualizations
These types of data visualizations are what non-practitioners would be most familiar with, as they include the charts and graphs we see regularly to explain data-based trends and patterns. Data visualizations like these are incredibly powerful tools for communicating information, which is what makes it so important to do them well.
A great narrative data visualization communicates the author’s understanding of the data precisely, while also conveying any uncertainties or limitations in the data. It’s not the sole output or end product of a data discussion, but rather a part of the larger conversation. This type of data viz should indicate the analysis performed, the assumptions made, the data considered, and the data not included. When constructing a narrative data visualization, it’s important to think not just about your own experience but also about the end user. For instance, they may be colorblind, and unable to distinguish red from green. Also, consider the aesthetics of a narrative data visualization; for instance, in the U.S., many viewers will associate green with “good.”
Being intentional with narrative data visualizations is especially important given the myriad ways to make today’s graphs look visually appealing, such as with fancy graphics or animations. It’s essential to not skip over a quality data-science process in the quest to create these visualizations—there are infinite ways today to put lipstick on a pig when it comes to data viz, but if the core data or assumptions are flawed, then it doesn’t matter how good the chart looks. Accuracy shouldn’t be sacrificed for the sake of a clear, visual story.
Data visualization as a component of ethics
As data scientists, we have the power to help shape business decisions, public policy, medical research, and other essential areas of daily life. It’s incumbent on us to practice our craft responsibly and ethically, and that includes the data visualization process. To the best of our ability, we need to ensure our visualizations make clear any assumptions or biases that might be baked into our results, and that they support viewers in asking further questions, rather than serving as a “period” on any discussion. Whether exploratory or narrative in purpose, data visualizations will fundamentally anchor the way the data and topic are viewed, so if it’s worth making a chart in the first place, it’s worth taking the time to do it right.