CyberPandas: Extending Pandas with Richer Types

 

Over the past couple months, Anaconda has supported a major internal refactoring of pandas. The outcome is a new extension array interface that will enable an ecosystem of rich array types, that meet the needs of pandas’ diverse user base. Using the new interface, we’ve built a library called cyberpandas: a high-performance container for IP Address data, which can be stored inside a DataFrame.

Some Background

Roughly speaking, you can think of a pandas DataFrame as a dictionary of NumPy arrays. NumPy provides all the basic types like floats, ints, datetimes. These are often sufficient, but pandas users sometimes need richer types like datetimes with time zones, or Categoricals.

Historically, pandas has supported these richer types by hacking up our internals to special case things. Let’s consider Categorical, which is a way to represent data that comes from a fixed set of discrete values called categories. Internally, this is actually two arrays. So the categorical

>>> Categorical(['a', 'a', 'b', 'b'])

would be stored as:

  1. The categories: Index(['a', 'b'])
  2. The code: array([0, 0, 1, 1])

In the past, pandas had a list of these “extension” array types that we had implemented internally. Whenever it encountered a Categorical, say, pandas sent things down a special code path just for them. The maintenance burden for each extension array was extremely high. We had to be very picky about which types we deemed worth for inclusion in pandas, which limits pandas use in “niche” fields that would like to use data types not included in NumPy or pandas.

We were able to take the time to define a proper interface for what pandas considers an “array”. This is the new ExtensionArray interface that’s included in pandas 0.23.0. When pandas encounters one of these arrays, it’ll happily store the array as-is, rather than coercing it to a NumPy array.

Introducing Cyberpandas

One of these “niche” fields (which isn’t all that niche) is cyber-security. For this community, or even just your general web developer or sys admin, it’s common to have datasets that include IP Addresses.

In the past, IP Addresses would probably be stored as strings. But this is error prone (not all strings are IP Addresses) and slow. Building on the ExtensionArray interface, cyberpandas provides two new types: one for IP Address data and one for MAC Address data.

In [1]: from cyberpandas import IPArray

In [2]: import pandas as pd

In [3]: arr = IPArray(['192.168.1.1',
   ...:                '2001:0db8:85a3:0000:0000:8a2e:0370:7334'])
   
In [4]: ser = pd.Series(arr)

In [5]: ser
Out[5]: 
0                     192.168.1.1
1    2001:db8:85a3::8a2e:370:7334
dtype: ip

Notice the dtype. The data are still stored as an IPArray. This, combined with a custom accessor, enables a high-performance workflow that will feel natural to pandas users:

In [6]: ser.ip.is_ipv6
Out[6]: 
0    False
2     True
dtype: bool

Cyperpandas can be installed today from conda-forge and PyPI.

The Broader Picture

This is an exciting development in pandas history. It will solve some of the longest-standing issues like the lack of integer-NA We’ll be able to prototype Apache Arrow-backed DataFrames within pandas, side-by-side with NumPy-backed versions.

More importantly, it will enable developers outside of pandas to define custom array types outside pandas. I’ve already spoken to medical researchers who like to use pandas for their data analysis, but struggle with linking their tabular data with their MRI data. The MRI data could now be stored in an MRIArray that satisfies the new extension array interface, and stored in a DataFrame just like any other column. Other examples would be DataFrames backed by GPU-memory, or columns for storing nested JSON data.


You May Also Like

Data Science Blog
Intake for Cataloging Spark
By: Martin Durant Intake is an open source project for providing easy pythonic access to a wide variety of data formats, and a simple cataloging system for these data sources....
Read More
Data Science Blog
Galvanize Capstone Series: Predicting Demand with RideAustin
The purpose of my project is to try to bridge that gap. If I can successfully predict what areas in Austin are likely to have high demand, I can tell drivers about those spots...
Read More
Data Science Blog
Anaconda and Full Spectrum Analytics Partner to Deliver Enterprise Data Science to Banks, Lenders, and Investments Firms
Anaconda, Inc., the most popular Python data science platform provider with 2.5 million downloads per month, is pleased to announce a new partnership with Full Spectrum Analyt...
Read More