CyberPandas: Extending Pandas with Richer Types

 

Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library called cyberpandas: a high-performance container for IP Address data, which can be stored inside a DataFrame.

Some Background

Roughly speaking, you can think of a pandas DataFrame as a dictionary of NumPy arrays. NumPy provides all the basic types like floats, ints, datetimes. These are often sufficient, but pandas users sometimes need richer types like datetimes with time zones, or Categoricals.

Historically, pandas has supported these richer types by hacking up our internals to special case things. Let’s consider Categorical, which is a way to represent data that comes from a fixed set of discrete values called categories. Internally, this is actually two arrays. So the categorical

>>> Categorical(['a', 'a', 'b', 'b'])

would be stored as:

  1. The categories: Index(['a', 'b'])
  2. The code: array([0, 0, 1, 1])

In the past, pandas had a list of these “extension” array types that we had implemented internally. Whenever it encountered a Categorical, say, pandas sent things down a special code path just for them. The maintenance burden for each extension array was extremely high. We had to be very picky about which types we deemed worth for inclusion in pandas, which limits pandas use in “niche” fields that would like to use data types not included in NumPy or pandas.

We were able to take the time to define a proper interface for what pandas considers an “array”. This is the new ExtensionArray interface that’s included in pandas 0.23.0. When pandas encounters one of these arrays, it’ll happily store the array as-is, rather than coercing it to a NumPy array.

Introducing Cyberpandas

One of these “niche” fields (which isn’t all that niche) is cyber-security. For this community, or even just your general web developer or sys admin, it’s common to have datasets that include IP Addresses.

In the past, IP Addresses would probably be stored as strings. But this is error prone (not all strings are IP Addresses) and slow. Building on the ExtensionArray interface, cyberpandas provides two new types: one for IP Address data and one for MAC Address data.

In [1]: from cyberpandas import IPArray

In [2]: import pandas as pd

In [3]: arr = IPArray(['192.168.1.1',
   ...:                '2001:0db8:85a3:0000:0000:8a2e:0370:7334'])
   
In [4]: ser = pd.Series(arr)

In [5]: ser
Out[5]: 
0                     192.168.1.1
1    2001:db8:85a3::8a2e:370:7334
dtype: ip

Notice the dtype. The data are still stored as an IPArray. This, combined with a custom accessor, enables a high-performance workflow that will feel natural to pandas users:

In [6]: ser.ip.is_ipv6
Out[6]: 
0    False
2     True
dtype: bool

Cyperpandas can be installed today from conda-forge and PyPI.

The Broader Picture

This is an exciting development in pandas history. It will solve some of the longest-standing issues like the lack of integer-NA We’ll be able to prototype Apache Arrow-backed DataFrames within pandas, side-by-side with NumPy-backed versions.

More importantly, it will enable developers outside of pandas to define custom array types outside pandas. I’ve already spoken to medical researchers who like to use pandas for their data analysis, but struggle with linking their tabular data with their MRI data. The MRI data could now be stored in an MRIArray that satisfies the new extension array interface, and stored in a DataFrame just like any other column. Other examples would be DataFrames backed by GPU-memory, or columns for storing nested JSON data.


You May Also Like

Company Blog
Anaconda Rides its Way into Gartner’s Hype Cycle
If you’re an Anaconda user and/or frequent reader of our blog, then you know how passionate we are about empowering our community (and future community!) with all the resour...
Read More
Company Blog
How PNC Financial Services Leveraged Anaconda to Enable Data Science and Machine Learning Capabilities Across the Company
As an AI software company passionate about the real-world practice of data science, machine learning, and predictive analytics, we take great pleasure in hearing about the ins...
Read More
Data Science Blog
Continuum Analytics Appoints Scott Collison as Chief Executive Officer
Continuum Analytics Appoints Scott Collison as Chief Executive Officer AUSTIN, TEXAS—January 17, 2017—Continuum Analytics, the creator and driving force behind Anaconda, t...
Read More