Advanced IOPro

In previous posts here and here, my colleague Ben has done a great job of exploring the performance characteristics of IOPro. A fast, memory efficient csv parser is a great and much needed tool in the Python world, but what else can it do?

Shrinking Your Data

It has been mentioned previously that IOPro can decompress gzip data on the fly, like so:

adapter = IOPro.text_adapter('data.gz', compression='gzip')
array = adapter[:]

 

Python

Aside from the obvious advantage of being able to store and work with your compressed data without having to decompress first, you also don’t need to sacrifice any performance in doing so. To illustrate this, I have a 419 MB csv file of numerical data, and a 105 MB file of the same data compressed with gzip. Here are the “best of three” run times for loading the entire contents of each file into a NumPy array:

uncompressed: 13.38 sec gzip compressed: 14.54 sec

The compressed file takes slightly longer, but consider having to uncompress the file to disk before loading with IOPro:

uncompressed: 13.38 sec gzip compressed: 14.54 sec gzip compressed (decompress to disk, then load): 21.56 sec

Storing your data in a quarter of the space while keeping load times virtually the same is clearly the way to go.

But I Want it Now!

****One of the most useful features of IOPro is the ability to index data to allow for fast random lookup. Let’s take an extreme example and try to retrieve the last record of the compressed 109 MB dataset we used above:

adapter = IOPro.text_adapter('data.gz', compression='gzip')
array = adapter[-1]

 

Python

Retrieving the last record into a NumPy array takes 14.82 sec. This is about the same as the time we saw to read the entire record, because we have to read through the entire dataset to get to the last record. Compressed data presents an additional difficulty, because we can’t just seek close to the end of compressed data and start parsing records until we get to the last one. Let’s build an index that will help us out:

index = adapter.create_index()

The above method creates an index in memory and returns it as a NumPy array, taking 9.48 sec. Now when we try seeking to and reading the last record again, it takes a mere 0.02 sec. Since the index is returned as a NumPy array, it’s easy to save it somewhere and reload it later in a new session:

adapter.set_index(index)

Reloading the index only takes 0.18 sec. Build an index once, and get near instant random access to your data forever. Indexing improvements are in heavy development right now, so expect even more interesting features and optimizations in the near future.

Let’s Get Our Hands Dirty

****What about messy data? IOPro is a set of data adapters, not a miracle worker, but let’s see what we can do to tame less than ideally formatted csv data. Take for example the following snippet of actual NASDAQ stock data found on the Internet:

Apple,AAPL,NasdaqNM,363.32 – 705.07 Google,GOOG,NasdaqNM,523.20 – 774.38 Microsoft,MSFT,NasdaqNM,24.30 – 32.95

The first three fields are easy enough: name, symbol, and exchange. The fourth field presents a bit of a problem. The string represents the range of prices for the past 52 weeks. I really want those values stored in two float fields rather than a single string field, but what can I do? IOPro doesn’t (yet) have support for multiple delimiter characters, but even if you could specify both a comma and space as delimiters, it wouldn’t help us much here. Let’s forget about delimiters and try IOPro’s regular expression based parser:

regex_string = '([A-Za-z]+),([A-Z]{4}),([A-Za-z]+),([0-9]+.[0-9]{2})s-s([0-9]+.[0-9]{2})'
adapter = IOPro.text_adapter('data.csv', parser='regex', regex_string=regex_string)
array = adapter[:]

 

Python

Regular expressions can admittingly get pretty ugly, but they can also be very powerful. By using the above regular expression with the grouping operators ‘(‘ and ‘)’, we can define exactly how each record should be parsed into fields. Let’s break it down into individual fields:

([A-Za-z]+) defines the first field (stock name) in our output array,

([A-Z]{4}) defines the second (stock symbol),

([A-Za-z]+) defines the third (company name),

([0-9]+.[0-9]{2}) defines the fourth field (low price), and

([0-9]+.[0-9]{2}) defines the fifth field (high price)

The output array contains five fields: three string fields and two float fields. Exactly what we want.

That’s Not All

IOPro also comes with a fork of the PyODBC module that stores query results in NumPy arrays instead of tuples of Python objects. In addition to being a more memory efficient way to store query results, using NumPy arrays results in much fewer Python objects and memory copy operations. This allows IOPro’s PyODBC module to perform up to 7x faster for whole table queries and up to 12x faster for single column queries.

To learn more about IOPro, visit our documentation.

On Deck

Continuum Analytics is filled with people who are passionate about data and software development, and we’re just getting started. For IOPro in particular, the features highlighted here are just the tip of the iceberg. Expect to see significant performance optimizations, support for data in the cloud, and more in our next iteration of IOPro coming soon. For a preview, how about some new benchmarks? The following is a comparison of IOPro 1.1, the upcoming IOPro 1.2, and a great looking new csv reader for Pandas released on github recently. The benchmarks were performed using the 700 MB astro dataset used in a recent comparison by Pandas author Wes McKinney.


About the Author

Q. What is your superpower(s)?

A. Developer

Q. What is your technical specialty or area of research?

A. Software developer for IOPro, NumPy and Numba

Q. What world ch …

Read more

Join the Disucssion