Today, Continuum Analytics is releasing a new product, IOPro. This module provides fast reading of data into NumPy arrays. The engine is written in C, which ensures that data can be parsed as fast as it can be read from the source. Additionally, IOPro optimizes memory usage by allowing dtype specification of fields as well as fast indexing for large files which cannot fit into memory. In today’s post I want to introduce basic parsing with IOPro — and for that we’ll need some data.
I’ve mentioned Amazon’s S3 public data sets in previous posts. It can be difficult to find or even generate large data and so I often dip into the S3 public data bucket. In this example, I’m using Wikipedia page traffic statistics collected between 1/1/2011-3/31/2011. The full data set is 2160 files totaling ~150GB.
Each line has 4 fields: projectcode, pagename, pageviews, byte size of the page. Below we see that one person visited the Catalan Wikipedia and viewed the Casino de Manresa and 6 people viewed the Castell d’Eramprunyàunyà — perhaps the visitors were interested in medieval Andalusian defense systems?
ca Casino_de_Manresa 1 8334 ca Caspar_David_Friedri 2 20242 ca Casquet_Glacial_Pata 1 8640 ca Castell_d%27Eramprun 6 11885
The files are originally gzipped; with IOPro there are two options for parsing:
- uncompress the file then parse
- parse file without uncompressing
Let’s first work with an uncompressed file:
Above, we are loading the fully parsed file into memory. While this will provide faster data retrieval, sometimes we don’t want to or can’t store the fully parsed file. IOPro parses text in small chunks, so we access records deep into the file without loading all the contents. For example, we can retrieve a record near the six-millionth line:
The arguments of text_adapter define the file we are working with and the way fields within the file are separated. We specify that there are no field names in the header. Next, we define the dtypes of the four fields. Fields 0 and 1 are defined to be of type object and fields 2 and 3 are defined to be a 32-bit and 64-bit ints respectively.
Why use type object? In the exploratory phase of data analysis we don’t know what our data looks like. Dynamic string lengths can be a notorious point of pain and there is no obvious solution for how to appropriately handle the outliers of extremely long strings. If we define the dtype with the longest string we will easily blow out all the memory we have. For example, in the data set we’re using, there exist strings longer than 1500 characters. With 6 million lines in any given file, we’d be using nearly 9GBs to handle only a few lengthy strings. But if we parse each string as a Python object, then each string will be appropriately sized.
We also have the option of truncating strings. If most of the information in a string is located in the first few characters, we can direct IOPro to convert a string of length X to a smaller length y as IOPro parses each string.
Above, I’ve decided that 20 characters is all I need. In fact, I can do even better — The first field (0), is the Wikipedia project code. The longest code is 12 characaters: zh-classical so I could set the dtype for field 0 to |S12
Truncating, in other words, lets us save even more memory.
How does IOPro stack up against other tools? With many variables — hardware, data-sets, etc. — performance can be difficult to measure and often is misleading. We can do some basic benchmarking in a pretty straightforward way with Python, but we need to always keep in mind that accurate, reproducible benchmarking is a subtle and difficult task
Aside from NumPy, the only other tool I use for parsing and processing CSV files is Pandas. It is the go-to library for data analysis in Python, and I think it’s a great tool. I’m excited to see how our product compares to it. So let’s run some comparisons.
All tests were run with the same data set from the public Wikipedia page statistics: pagecounts-20110331-220000. Compressed, the file is 72MB; uncompressed, it’s 246MB. I used a Lenovo X200, 2.4GHz Core 2 Duo, with 4GBs of RAM, running a 64-bit version of Debian with Linux kernel 3.2.
To begin with, let’s look at memory. As data sets grow, individual files can grow as well. Talk to anyone involved with processing big data and they will invariably mention a primary concern: can I process the data in memory or is an out-of-core solution needed? IOPro does extremely well with memory optimization. Memory benchmarks were measured using Valgrind’s memory profiler MASSIF
As you can see above, with IOPro we use significantly less memory than Pandas. Because Pandas stores all non-numeric fields as objects we expect some gains when using truncated strings. Additionally, because Pandas is more than a data adapter and provides many views of the data, we also expect IOPro storage of the same file to occupy less memory.
Next, let’s measure parsing speed. I used the Python module Timeit. Below we see that with this data set, parsing an uncompressed file with string fields as objects, Pandas and IOPro are nearly matched. We see longer times for parsing files with a truncated fields because the convert function:
is an additional function call which slows parsing down. However, we do see some efficiency gains by parsing the gzipped file directly with IOPro. Pandas, at this time, doesn’t allow for compressed file parsing.
With fast and efficient text parsing, memory optimized objects, and simple interface, IOPro is the tool which gets out of your way and lets you get on with your data analysis.
IOPro Documentation can be found here