By Tom Augspurger, Data Scientist at Anaconda
Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library called cyberpandas: a high-performance container for IP Address data, which can be stored inside a DataFrame.
Roughly speaking, you can think of a pandas DataFrame as a dictionary of NumPy arrays. NumPy provides all the basic types like floats, ints, datetimes. These are often sufficient, but pandas users sometimes need richer types like datetimes with time zones, or Categoricals.
Historically, pandas has supported these richer types by hacking up our internals to special case things. Let’s consider
Categorical, which is a way to represent data that comes from a fixed set of discrete values called categories. Internally, this is actually two arrays. So the categorical
>>> Categorical(['a', 'a', 'b', 'b'])
would be stored as:
- The categories:
- The code:
array([0, 0, 1, 1])
In the past, pandas had a list of these “extension” array types that we had implemented internally. Whenever it encountered a Categorical, say, pandas sent things down a special code path just for them. The maintenance burden for each extension array was extremely high. We had to be very picky about which types we deemed worth for inclusion in pandas, which limits pandas use in “niche” fields that would like to use data types not included in NumPy or pandas.
We were able to take the time to define a proper interface for what pandas considers an “array”. This is the new
ExtensionArray interface that’s included in pandas 0.23.0. When pandas encounters one of these arrays, it’ll happily store the array as-is, rather than coercing it to a NumPy array.
One of these “niche” fields (which isn’t all that niche) is cyber-security. For this community, or even just your general web developer or sys admin, it’s common to have datasets that include IP Addresses.
In the past, IP Addresses would probably be stored as strings. But this is error prone (not all strings are IP Addresses) and slow. Building on the
ExtensionArray interface, cyberpandas provides two new types: one for IP Address data and one for MAC Address data.
In : from cyberpandas import IPArray In : import pandas as pd In : arr = IPArray(['192.168.1.1', ...: '2001:0db8:85a3:0000:0000:8a2e:0370:7334']) In : ser = pd.Series(arr) In : ser Out: 0 192.168.1.1 1 2001:db8:85a3::8a2e:370:7334 dtype: ip
dtype. The data are still stored as an
IPArray. This, combined with a custom accessor, enables a high-performance workflow that will feel natural to pandas users:
In : ser.ip.is_ipv6 Out: 0 False 2 True dtype: bool
Cyperpandas can be installed today from conda-forge and PyPI.
The Broader Picture
This is an exciting development in pandas history. It will solve some of the longest-standing issues like the lack of integer-NA We’ll be able to prototype Apache Arrow-backed DataFrames within pandas, side-by-side with NumPy-backed versions.
More importantly, it will enable developers outside of pandas to define custom array types outside pandas. I’ve already spoken to medical researchers who like to use pandas for their data analysis, but struggle with linking their tabular data with their MRI data. The MRI data could now be stored in an
MRIArray that satisfies the new extension array interface, and stored in a DataFrame just like any other column. Other examples would be DataFrames backed by GPU-memory, or columns for storing nested JSON data.