In my last blog post on IOPro I promised new benchmarks that show off IOPro’s improved csv text parsing performance. The latest version of IOPro (version 1.2.3) has a vastly improved memory footprint and run time performance, especially for string data. The latest version can be obtained from within Anaconda by running the command ‘conda update IOPro’. If you haven’t tried out Anaconda yet, what are you waiting for? Learn more about it here, and then download it here.
Let’s start with IOPro’s memory footprint. In a recent benchmark comparison, it was shown that IOPro was effectively loading two copies of the data set in memory. This was surprising to me since IOPro’s csv parsing engine is designed from the ground up to use as little memory as possible and to avoid making multiple copies of data. After a bit of playing around, I discovered the unnecessary copy and was able to cut memory usage in half to where it should be. For comparison, I’ll use IOPro 1.1 as well as Pandas’ fast new csv parser. As far as I know, Pandas has the only csv parser available for python that compares with IOPro’s performance. Memory usage was measured with valgrind using the same large numerical astronomy data set used in our previous benchmarks:
Now this is more like it! IOPro also avoids making multiple copies of data without compromising on speed:
Keep in mind IOPro also checks for NA values by default just like Pandas.
What about string heavy data sets? Previous versions of IOPro were not particularly impressive when it came to converting csv data to string dtypes. Memory usage for string types in IOPro 1.2.3 has dramatically improved, partly by using string object types as the default string type instead of fixed length strings, but also by avoiding storing multiple copies of the same string object. The following shows off parsing the 151 MB FEC data set used in previous benchmarks:
Don’t forget that you don’t have to modify your existing Python code for IOPro to take advantage of its fast csv text parsing abilities. IOPro comes with optimized versions of NumPy’s loadtxt and genfromtxt functions built on top of IOPro that give you the same functionality as the NumPy versions.
IOPro uses just about the bare minimum amount of memory needed for parsing and loading csv data into NumPy arrays, but can the already blazing fast run time performance be improved even more? Expect more performance improvements in the near future.
New to IOPro 1.2.3 is experimental integration with NumbaPro, the amazing NumPy aware Python compiler also available in Anaconda. Previously when parsing messy csv data, you had to use either a very slow custom Python converter function to convert the string data to the target data type, or use a complex regular expression to define the fields in each record string. Using the regular expression feature of IOPro will certainly still be a useful and valid option for certain types of data, but it would be nice if custom Python converter functions weren’t so slow as to be almost unusable. Numba solves this problem by compiling your converter functions on the fly without any action on your part. Simply set the converter function with a call to set_converter_function() as before, and IOPro + NumbaPro will handle the rest. To illustrate, I’ll show a trivial example using the sdss data set again. Take the following converter function which converts the input string to a floating point value and rounds to the nearest integer, returning the integer value:
We’ll use it to convert field 1 from the sdss dataset to an integer. By calling the set_converter method with the use_numba parameter set to either True or False (the default is True), we can test the converter function being called as both interpreted Python and as Numba compiled llvm bytecode. In this case, compiling the converter function with NumbaPro gives us a 5x improvement in run time performance. To put that in perspective, the Numba compiled converter function takes about the same time as converting field 1 to a float value using IOPro’s built in C compiled float converter function. That isn’t quite an “apples to apples” comparison, but it does show that NumbaPro enables user defined python converter functions to achieve speeds in the same league as compiled C code.
Data in the Cloud
Also new to IOPro is the ability to parse csv data stored in Amazon’s S3 cloud storage service. The S3 text adapter constructor looks slightly different than the normal text adapter constructor:
The first two parameters are your AWS access key and secret key, followed by the S3 bucket name and key name. The S3 csv data is downloaded in 128K chunks and parsed directly from memory, bypassing the need to save the entire S3 data set to disk first. IOPro can also build an index for S3 data just as with disk based csv data, and use the index for fast random access lookup. If an index file is created with IOPro and stored with the S3 dataset in the cloud, IOPro can use this remote index to download and parse just the subset of records requested. This allows you to generate an index file once and share it on the cloud along with the data set, and does not require others to download the entire index file to use it.
To see the new S3 text adapter in action, take a look at our web based python data analysis environment Wakari where several public S3 data sets are provided.
Whether analyzing csv and relational database data locally in Anaconda, or S3 data in the cloud on Wakari, there simply isn’t an easier or more efficient solution available for loading your data into NumPy arrays than IOPro. In addition, IOPro 1.2.3 can load S3 CSV data directly into NumPy arrays, create and load local and remote index files for your data, and can use NumbaPro to speed up user defined converter functions. Look forward to optimized ODBC drivers and support for NoSQL databases in the near future.