5 Routes for Going from Zero to Viz in Data Science

This could be you! Click here to submit an abstract for our Maker Blog Series.


Exploration is a big part of my maker journey in visual analytics. When it comes to data, I ask a lot of what, when, where, and how questions. In this post, you’re invited along with me to focus on how to leverage the Anaconda data science platform to build a playground for visualizing data with Python, from your installation on day one and as an extensible toolbox on day 565 and beyond. This tour is more meandering than linear but purposefully set out for opportunities that come with including visual tooling in your work.

Visual analysis is a powerful necessity for human advancement in a world abuzz with data, and Anaconda is a doorway through which many students and professionals access these tools. You may be thinking either, “What more is there to visualizing data that I don’t already know,” or maybe you’re in the camp of, “Yes, please! I’m curious about open-source visual analytics tools.” Either way, I’ve been thinking about you, and how you can be more visual in your analytical work (and spend more time getting high fives for your work and less time reading numbers).

Read on to learn:

 

  1. Where does visualization fit in? What kinds of questions are fit for visual analysis?

  2. How we teach and support students to continue learning.

  3. Community resources to help you find visual analytics tools.

  4. Quick demos of a handful of open-source tools.

 

Each cited package was handpicked from many more Python and R options that are rich in blueprints for visual analysis. For each one, you will see how to do something exceptional. My intention is to encourage you to explore these and more tools yourself. Think of this as a campus tour you never had led by a dual-major in the arts and industry schools.

 

You succeed if your curiosity is stoked enough to pause and try something new. You may find just what you’ve always needed, or learn to appreciate your favorite tool in a new light. Celebrate your Anaconda birthdays by trying new tools to visualize data!

Where Does Visualization Fit In? And When?

Short answer? Everywhere. Often.

Data visualization is useful for your own learning, to show confidence levels for the results of tests and experiments, and for communicating those results to other people. Before you load your first or next Python or R programming package, ask yourself what the data look like. Today, more people in more roles interact with data science. The old saying, “show don’t tell,” is more useful than ever for effective data-informed learning, with more eyes on the data and more hands stirring the pot. To learn with data, communicate as teams, and find and reduce bias[1] as a larger society, we need tooling to decode data in every part of a closed-loop system. Data, and the instruments and models that proliferate it, belong in our common field of vision (and outside black boxes where bias can grow unchecked in the shadows).

This is where visualization fits in. It’s of critical value with machine learning model selection, interpretability, explainability, and observability. Technologies like computer vision have shown the potential to outperform humans in tasks like image identification since as long ago as 2015[17], but the value of any intelligence to us is still bound by our understanding and trust. Alongside the pursuit of artificial cognition, perhaps that combines the best of purely neural and symbolic systems, we continue to develop ways to help more people understand how AI works. Often simple charts help us see into the most complex network models.

In the end, perhaps the greatest lesson (historical plaster mathematical) models can teach us today is to remind us that in our race to computerize our world, sometimes the digital world is not always better than the analog one it replaces.

Kalev Leetaru Scientific Visualizations For Teaching Used To Mean Plaster Models (2019)[19]

Big talk aside, wherever you work along the arc of analytical study and experimentation from business to genomics, you can channel what we know about the processing superpowers of the human visual system. With each new research study[9], we learn more about how our physiological systems function for perception and cognition, but there is a persistent recognition and widely felt appreciation of the fact that looking at data has intrinsic value in any analytical process.

Anaconda’s team has echoed the ability of the human brain to process images quickly and accurately, and it’s no surprise you’ll find a slew of packages with plotting capabilities baked into the thousands you get access to on the Anaconda platform. From essential comparative charts to statistical analyses, predictive analytics, machine learning, and artificial intelligence, a toolkit is neither mature nor lean without graphs or plots of some kind. Prove me wrong? Selfishly, I’m confident we’ll both learn either way.

Visual Questions to Ask

To apply this superpower more concretely to data use cases, data visualization can pinpoint the next question worth spending your time on and raise certainty around findings faster than wading through bodies of text or culling through numbers. There are so many ways to probe data visually.

  1. Exploratory: What attributes correlate to the best outcomes that you want to replicate? What does the distribution of the dataset look like?

  2. Compare and correlate: What performed better? Where—or when—did a feature change? Did anything else change in one direction or the other? Is there a linear pattern to the relationship that’s as easy to draw a line from as a connect-the-dots coloring book?

  3. Spatial, temporal, or both: Where did it start? How far will it go? When, and how frequently?

  4. Meta: Are missing values bundled or spread throughout a column? Are there significant gaps in data collection that could cause trends to be misinterpreted?

  5. Statistical: What level of confidence does my work merit? How large is the spread? Are mean or median values more illustrative of what occurs in each particular dataset, on average? How does the data skew, and if you threw a net around it where would you bag roughly half of it? What outliers are kept out of the loop?

Start your own list!

People are quite good at using shapes, clocks, and geographic maps to answer everyday questions. These expressions and projections of data have the potential to span cultures and languages with fairly small degrees of variation compared to words and numbers. With consideration for your audience and appropriate sensitivity to differences, such as with how colors or directionality on the screen are interpreted, you can limit confusion and increase the chances they find the answers they need.

Using Anaconda already? Read on ahead!

Before you start:

To run the code examples as you read along, I have presumed that you have installed Anaconda Distribution. For nostalgia’s sake, this is the option formerly known as Individual Edition. It’s easy to install and completely free of cost. If you have any questions while getting set up, don’t be discouraged—there’s a community to support you!

If you’re a minimalist and don’t want all the many packages on your machine just yet, or prefer a more custom modular approach to installing packages, start instead by installing Miniconda. I will applaud your discretionary taste. With Miniconda, you’ll need to install more of the packages used in these examples as you go.

 

Most of the demos that accompany this article are shown in a Jupyter Notebook. An interactive computing notebook suits my stream-of-consciousness learning style and a hoarding tendency with notes. The code itself is portable to your command-line interface (CLI) or integrated development environment (IDE), for the most part.

Managing environments with conda is another topic I suggest you explore, but I’ve left that outside the scope of going from zero to viz to avoid unnecessary friction and send you on ahead to explore. Virtual environments specific to each task, or for example a sandbox, separate your work, help limit conflicts with package dependencies, and even keep versions of Python separate so your archived projects don’t break when you upgrade.

Setting up an environment doesn’t have to be difficult though, and can make it easy to scrap everything and start fresh if you decide you won’t use a package you try. The basic commands can be run in a Shell/Terminal or command-line interface. For an in-depth guide, see the link above.

`conda create –name vizenv`

Then after confirming `y` to answer yes to create the environment, activate it.

`conda activate vizenv`

However you decide to start, get set up first to get the most from what I’ve made for you. 💌

Eyes on How We Learn

Now that you have some direction, where should you start? There is no best-in-class visualization package. Not in Python, or R, nor is there a one-size-fits-all tool in the broader visual analytics landscape. What’s important is finding what works for your needs: tools that are close at hand to your process, are flexible and fast enough to embed in your workflow without friction, and which in the context of Anaconda, integrate with existing programming libraries you rely on.

Data science throws a wide net around many disciplines. For each problem to solve, there are many suitable projects built around plotting data in the Python ecosystem alone. The Anaconda platform offers access to many open-source software projects with visualization and other imaging capabilities, but often as students when we install our first distribution of Python or R, not enough time and attention are spent on exploring all the options included to visualize data in common or distinctive circumstances. This is your chance to take another look at a graphing library you haven’t used or try a component built for a specific use case that could have potential in other areas.

Learning as a Graph, Not a Linear Path

If you’re teaching visualization, how can you encourage time for exploration? Sure, sufficient time in one tool is important to put concepts to code, and Matplotlib and ggPlot are great tools. Learning concepts and code simultaneously is tough, and a lot of additional code-switching can slow you down.

With a consistently structured interpreted language like Python, however, it’s also attainable to learn from exposure to more than one project. In debugging code, for example, it’s invaluable to have a broad understanding of how projects are structured from top to bottom, what’s unique and what’s common, what tools perform some tasks very well, and what the others do better. This kind of rigorous curiosity comes with trying many packages and tools.

“That there is no such thing as the scientific method, one might easily discover by asking several scientists to define it. One would find, I am sure, that no two of them would exactly agree. Indeed, no two scientists work and think in just the same ways.”

Joel Henry Hildebrand (1985) “Science in the Making”, Praeger Pub Text [2,3]

A chemist might seem odd to quote here, especially since the death of linear maps[4] like his shared namesake Benesi–Hildebrand plots that fitted data just fine in their time. Even if he had not been called, “a genius in finding ways to present data so that they fall on a straight line,”[3] and even if his roots and education had not been in and around Philadelphia (I am local and partial to Philly), Dr. Hildebrand’s scientific approach and teaching are remarkable here for his relentlessly questioning mind. That is just the sort of scrutiny to have with selecting tools, and not just with data.

Do you have a set of tests for what makes a choice tool for you? Is it available as an extension to a scientific computing library you use? Will it be a remedy to dizzying tables of numbers (or perhaps it excels at styling tabular displays of information for faster reading? Tables do still have a tried-and-true place in the visual canon! If you don’t find something you need in this post, you may find ideas or a cookie-cutter to build up your corner of the open-source maker community. After all, invention grows out of need and questioning minds.

 

This post is continued on Anaconda Nucleus.

 

References

(Please continue to Anaconda Nucleus for additional resources.)

[1] Dougherty, Jack, Ilyankou, Ilya, “Hands-On Data Visualization: Interactive Storytelling from Spreadsheets to Code, O’Reilly, Apt 4, 2022, https://handsondataviz.org/

[2] Hildebrand, Joel Henry, “Science in the Making”, Praeger Pub Text, March 5, 1985.

[3] Pitzer, K. S., “Joel henry Hildebrand 1881—1983: A Biographical Memoir by Kenneth S. Pitzer”, National Academy of Sciences, 1993, https://www.nasonline.org/publications/biographical-memoirs/memoir-pdfs/hildebrand-joel.pdf

[4] Hibbert, D. Brynn, Thordarson, Pall, The death of the Job plot, transparency, open science and online tools, uncertainty estimation methods and other developments in supramolecular chemistry data analysis, Royal Society of Chemistry, Aug 25, 2016, https://pubs.rsc.org/en/content/articlehtml/2016/cc/c6cc03888c

[9] Various authors, “perception and cognition” articles since 2022 search results, via Google Scholar, accessed May 9, 2022, https://scholar.google.com/scholar?as_ylo=2022&q=%22perception+and+cognition%22

[17] Pohl, Margit, Wallner, G., Kriglstein, S., “Using lag-sequential analysis for understanding interaction sequences in visualizations”, Science Direct, Aug 11, 2016, https://www.sciencedirect.com/science/article/abs/pii/S1071581916300829

[19] Leetaru, Kalev, Scientific Visualizations For Teaching Used To Mean Plaster Models, Forbes, Apr 20, 2019, https://www.forbes.com/sites/kalevleetaru/2019/04/20/scientific-visualizations-for-teaching-used-to-mean-plaster-models/?sh=23626a1f7ae2


About the Author

Kathryn Hurchla is a data developer and designer at home shaping human experiences as an Analytics Lead with F Λ N T Λ S Y, a design agency like no other. She has a master’s degree in data analytics and visualization and enjoys building end-to-end analytic applications and writing about visual data science. You can find her lost in exploratory data analysis. She contributes to open-source technology communities as a Plotly Dash Ambassador, by leading hands-on learning, and by publishing content independently and with Data Visualization Society’s Nightingale Editorial Committee. Her own enterprise Data Design Dimension may one day be just what her daughters need to make the world as they see it too. Her words are not a reflection of her employer.

About the Maker Blog Series

Anaconda is amplifying the voices of some of its most active and cherished community members in a monthly blog series. If you’re a Maker who has been looking for a chance to tell your story, elaborate on a favorite project, educate your peers, and build your personal brand, consider submitting an abstract. For more details and to access a wealth of educational data science resources and forums, visit Anaconda Nucleus.

Talk to an Expert

Talk to one of our financial services and banking industry experts to find solutions for your AI journey.

Talk to an Expert