CyberPandas: Extending Pandas with Richer Types

 

Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library called cyberpandas: a high-performance container for IP Address data, which can be stored inside a DataFrame.

Some Background

Roughly speaking, you can think of a pandas DataFrame as a dictionary of NumPy arrays. NumPy provides all the basic types like floats, ints, datetimes. These are often sufficient, but pandas users sometimes need richer types like datetimes with time zones, or Categoricals.

Historically, pandas has supported these richer types by hacking up our internals to special case things. Let’s consider

Categorical

, which is a way to represent data that comes from a fixed set of discrete values called categories. Internally, this is actually two arrays. So the categorical

>>> Categorical(['a', 'a', 'b', 'b'])

would be stored as:

  1. The categories:
    Index(['a', 'b'])
  2. The code:
    array([0, 0, 1, 1])

In the past, pandas had a list of these “extension” array types that we had implemented internally. Whenever it encountered a Categorical, say, pandas sent things down a special code path just for them. The maintenance burden for each extension array was extremely high. We had to be very picky about which types we deemed worth for inclusion in pandas, which limits pandas use in “niche” fields that would like to use data types not included in NumPy or pandas.

We were able to take the time to define a proper interface for what pandas considers an “array”. This is the new

ExtensionArray

interface that’s included in pandas 0.23.0. When pandas encounters one of these arrays, it’ll happily store the array as-is, rather than coercing it to a NumPy array.

Introducing Cyberpandas

One of these “niche” fields (which isn’t all that niche) is cyber-security. For this community, or even just your general web developer or sys admin, it’s common to have datasets that include IP Addresses.

In the past, IP Addresses would probably be stored as strings. But this is error prone (not all strings are IP Addresses) and slow. Building on the

ExtensionArray

interface, cyberpandas provides two new types: one for IP Address data and one for MAC Address data.

In [1]: from cyberpandas import IPArray

In [2]: import pandas as pd

In [3]: arr = IPArray(['192.168.1.1',
   ...:                '2001:0db8:85a3:0000:0000:8a2e:0370:7334'])
   
In [4]: ser = pd.Series(arr)

In [5]: ser
Out[5]:
0                     192.168.1.1
1    2001:db8:85a3::8a2e:370:7334
dtype: ip

Notice the

dtype

. The data are still stored as an

IPArray

. This, combined with a custom accessor, enables a high-performance workflow that will feel natural to pandas users:

In [6]: ser.ip.is_ipv6
Out[6]:
0    False
2     True
dtype: bool

Cyperpandas can be installed today from conda-forge and PyPI.

The Broader Picture

This is an exciting development in pandas history. It will solve some of the longest-standing issues like the lack of integer-NA We’ll be able to prototype Apache Arrow-backed DataFrames within pandas, side-by-side with NumPy-backed versions.

More importantly, it will enable developers outside of pandas to define custom array types outside pandas. I’ve already spoken to medical researchers who like to use pandas for their data analysis, but struggle with linking their tabular data with their MRI data. The MRI data could now be stored in an

MRIArray

that satisfies the new extension array interface, and stored in a DataFrame just like any other column. Other examples would be DataFrames backed by GPU-memory, or columns for storing nested JSON data.


You May Also Like

Data Science Blog
CyberPandas: Extending Pandas with Richer Types
Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of ric...
Read More
Company Blog
Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be
With free, open source tools like Anaconda Distribution, it has never been easier for individual data scientists to analyze data and build machine learning models on their lap...
Read More
Data Science Blog
insideBIGDATA: Anaconda Enterprise 5 Introduces Secure Collaboration to Amplify the Impact of Enterprise Data Scientists
https://insidebigdata.com/2017/09/10/anaconda-enterprise-5-introduces-secure-collaboration-amplify-impact-enterprise-data-scientists/...
Read More