CyberPandas: Extending Pandas with Richer Types

 

Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library called cyberpandas: a high-performance container for IP Address data, which can be stored inside a DataFrame.

Some Background

Roughly speaking, you can think of a pandas DataFrame as a dictionary of NumPy arrays. NumPy provides all the basic types like floats, ints, datetimes. These are often sufficient, but pandas users sometimes need richer types like datetimes with time zones, or Categoricals.

Historically, pandas has supported these richer types by hacking up our internals to special case things. Let’s consider Categorical, which is a way to represent data that comes from a fixed set of discrete values called categories. Internally, this is actually two arrays. So the categorical

>>> Categorical(['a', 'a', 'b', 'b'])

would be stored as:

  1. The categories: Index(['a', 'b'])
  2. The code: array([0, 0, 1, 1])

In the past, pandas had a list of these “extension” array types that we had implemented internally. Whenever it encountered a Categorical, say, pandas sent things down a special code path just for them. The maintenance burden for each extension array was extremely high. We had to be very picky about which types we deemed worth for inclusion in pandas, which limits pandas use in “niche” fields that would like to use data types not included in NumPy or pandas.

We were able to take the time to define a proper interface for what pandas considers an “array”. This is the new ExtensionArray interface that’s included in pandas 0.23.0. When pandas encounters one of these arrays, it’ll happily store the array as-is, rather than coercing it to a NumPy array.

Introducing Cyberpandas

One of these “niche” fields (which isn’t all that niche) is cyber-security. For this community, or even just your general web developer or sys admin, it’s common to have datasets that include IP Addresses.

In the past, IP Addresses would probably be stored as strings. But this is error prone (not all strings are IP Addresses) and slow. Building on the ExtensionArray interface, cyberpandas provides two new types: one for IP Address data and one for MAC Address data.

In [1]: from cyberpandas import IPArray

In [2]: import pandas as pd

In [3]: arr = IPArray(['192.168.1.1',
   ...:                '2001:0db8:85a3:0000:0000:8a2e:0370:7334'])
   
In [4]: ser = pd.Series(arr)

In [5]: ser
Out[5]: 
0                     192.168.1.1
1    2001:db8:85a3::8a2e:370:7334
dtype: ip

Notice the dtype. The data are still stored as an IPArray. This, combined with a custom accessor, enables a high-performance workflow that will feel natural to pandas users:

In [6]: ser.ip.is_ipv6
Out[6]: 
0    False
2     True
dtype: bool

Cyperpandas can be installed today from conda-forge and PyPI.

The Broader Picture

This is an exciting development in pandas history. It will solve some of the longest-standing issues like the lack of integer-NA We’ll be able to prototype Apache Arrow-backed DataFrames within pandas, side-by-side with NumPy-backed versions.

More importantly, it will enable developers outside of pandas to define custom array types outside pandas. I’ve already spoken to medical researchers who like to use pandas for their data analysis, but struggle with linking their tabular data with their MRI data. The MRI data could now be stored in an MRIArray that satisfies the new extension array interface, and stored in a DataFrame just like any other column. Other examples would be DataFrames backed by GPU-memory, or columns for storing nested JSON data.


You May Also Like

Enterprise Data Science
Scalable Machine Learning with Dask—Your Questions Answered!
Building powerful machine learning models often requires more computing power than a laptop can provide. Although it’s fairly easy to provision compute instances in the clou...
Read More
For Practitioners
Coming of Age in 2017: A Tale of Open Source & Open Data Science
It’s not surprising that as the calendar changed to a new year, there were not just a few articles outlining predictions for 2017 – including what this year’s list o...
Read More
News
ZDNet: Alteryx Promote delivers AI/machine learning model deployment, management and integration
http://www.zdnet.com/article/alteryx-promote-delivers-ai-machine-learning-model-deployment-management-and-integration/...
Read More