Behind the Code of Dask and pandas: Q&A with Tom Augspurger
Nov 13, 2020By Kelly Davis-Felner
Data science and related fields have been born in and pushed forward by open-source projects. Open-source communities allow for people to work together to solve larger problems. As stewards of the data science community, we believe it is important to go behind the lines of code to shine a light on those doing the work in open source. In a series of blogs, we’ll highlight several Anaconda employees, the open-source projects they work on, and how their work is making an impact on the larger field.
Tom Augspurger most recently has been a data scientist at Anaconda for almost four years. He maintains two open-source projects: Dask and pandas.
Q: What projects do you currently work on?
I focus on Dask and pandas.
pandas provides a DataFrame for working with tabular data, like you might find in an Excel spreadsheet or a database table. This is the first thing people turn to when they’re doing any kind of tabular data analysis. It’s a very popular project with tons of users and contributors. Maintenance of this project involves balancing between responding to users’ needs, doing the things we’d like to change and improve, and creating a welcoming space for new users who would like to get involved with open source.
Dask is a parallel programming library. The main reason Dask exists is to take libraries like pandas and NumPy, which are mainly focused on data sets that fit on a single machine, and scale them out to bigger problems. Dask has a Dask DataFrame component that lets you have a DataFrame on a cluster of machines. It lets you build things like a parallel distributed DataFrame or array.
Q: How did you get involved with each of these?
For pandas, I was in graduate school studying economics and had to do data analysis for a couple research papers. They got us started on MATLAB, another programming language. I enjoyed the programming side of things, but I didn’t particularly like using MATLAB. I started looking around and found Python. I found pandas in particular.
I started to use those libraries, and 6-12 months later, I started to contribute. My first contributions were very small, and I messed them up horribly. The maintainers were very helpful in getting me up to speed; I gradually became more involved with contributing. I left grad school and got a job where I used pandas quite a bit and was still contributing on the side.
That’s about the time when Dask was first getting started. It was started by Matt Rocklin, Jim Crist, and a few others at Anaconda. I started using and contributing to Dask — it was exactly what I needed to scale out my workflows. About 3.5 years ago, I was hired by Anaconda, so I get to do pandas and Dask maintenance as a full-time job.
Q: What is your role within these projects?
pandas doesn’t have much funded support. My most important goal right now is to find funding. For example, myself and another maintainer wrote a proposal to the Chan Zuckerberg Initiative, which has an open-source sustainability program. That got us funding for a year. Enabling others to work on pandas is the most valuable thing I am doing on the project. Outside of that, I help keep the project running: reviewing contributions, ensuring tests are passing, and driving discussion about and working on bigger changes. We have a lot of people involved with pandas who help maintain the community.
Dask has had better funding since the beginning because the maintainers have prioritized it. My main role focuses on the day-to-day maintenance, specifically of Dask-ML.
Q: What contribution are you most proud of?
For pandas, it would be the extension array interface. This came out of Anaconda; we had a client who wanted to store IP addresses in DataFrames. You can’t do that in pandas; we don’t have a data type for IP addresses. We had a hack-y, internal way of storing non-NumPy data types in DataFrames in series that we used for more exotic data types. So, I wrote up a proposal outlining what the client wanted to do to see what the pandas community thought. The feedback was that this wasn’t appropriate, but everyone was open to the idea of defining an interface so that another package could define this new data type for storing IP addresses, and that could go inside of a DataFrame. That led to two things: the interface letting third-party packages do this and then a package called Cyberpandas, which lets you store IP addresses inside of DataFrames.
With Dask, I am most proud of Dask-ML, which is a library for parallel and distributed machine learning. Dask DataFrame gives you large, pandas-like DataFrames. Dask-ML allows you to do scikit-learn type machine learning scaled out to larger datasets on a cluster of machines. It fills a useful place in the ecosystem, though it is far used less than Dask or pandas. It’s been fun for me to work on. It’s the type of thing I wish I had had 5-10 years ago in previous jobs.
Q: What are you working on now that you are most excited to release?
pandas has a longstanding issue with duplicate labels. If you have bad data with duplicates, that can completely blow up downstream operations. Duplicate labels will often change the size or result type of certain operations. So in the past, you had to make an assertion at every stage that you didn’t have any duplicates. In this next release, we’ll have a way to say that a DataFrame just can’t have duplicate labels. Then, pandas will make sure that every time you do an operation, we carry through that flag. If duplicates are introduced for whatever reason, it will inform you and raise an exception, rather than silently proceeding with the potentially incorrect results.
For Dask, there’s a big effort underway to improve the performance. The original motivation was to work with large data sets. We’re getting to the point where people have large enough workloads on these data sets that Dask itself can be the bottleneck. There’s a big effort to improve the performance of the scheduler on these large workloads.
Q: What do you envision for this project in a year from now?
pandas is in the middle of a long transition. Extension types are going to become the default. They started out as a way to work around weird issues, but they’re becoming the core of the library. This is primarily because they offer better ways of working with missing data. Extension arrays could be everywhere in a couple years.
With Dask, we have a collaboration with Pangeo, a community of geoscientists who want to do analysis of large data sets on the cloud and on High Performance Computing systems. This group is able to do things that others aren’t because they’ve done the work to deeply integrate with Dask. I’m interested in replicating that collaboration and experience in other domains, like life sciences and physics. I’d like to get a group of like-minded people together to discuss similar problems and come up with solutions for them. Dask also won a CZI grant, so we’re hiring someone to do some of this work in life sciences.
Q: In your mind, what is the value of open-source projects?
By contributing to these libraries, you become a better user. The better you understand a library, the better you can put it to good use. It’s also a valuable way to show real-world experience when looking for a job.
Q: Why should companies provide opportunities for employees to be involved with open-source projects?
From a company perspective, you’re rarely going to be able to build something better on your own than what the community as a whole can achieve. What makes the Python ecosystem so powerful is the interoperability between the major packages. NumPy could stand on its own, but Xarray builds heavily on top of pandas and NumPy. Plotting libraries like hvPlot build on top of all of these. Achieving this interoperability is not something you could do on your own. You would need to rebuild the entire stack.
Open-source projects can be fragile. Projects that are maintained by only one person or a small group of people are used by millions of users — that feels fragile. The best way for companies to support these projects is by hiring existing maintainers or hiring people who can become maintainers and build up trust with a community over time.
At Anaconda, we’re proud to support our employees’ involvement in open-source initiatives. To learn more about Dask and pandas and how we contribute to other open-source projects, visit our Open Source page.