Interacting Spiral Galaxies. Courtesy of SDSS DR7
At the end of the 16th century, Tycho Brahe achieved a scientist’s dream — he received funding! As the lead principal investigator (PI)on an ambitious data collection project, Brahe sought to index the heavens. He became famous for the extensive, voluminous, and highly accurate positions and movements of stellar bodies. Like many PIs, he needed help and soon was able to secure the employment of Johannes Kepler.
There is no denying Kepler’s mathematical prowess and ingenuity — but Brahe’s data was critical in developing Kepler’s eponymous Laws of Planetary Motion. In fact, Brahe guarded his astronomical data with a keen eye and refused Kepler’s request to copy the data for personal use.
Contemporary Heavens Gazing
With Queens, Kings, and the nobility of the world gone, contemporary science rests at the foot of the People. Public scientific support leads to public commons growth; often, this results in large data stores shared across the Internet. The Sloan Digital Sky Survey (SDSS) is one such example. Perhaps the most ambitious astronomical data collection project to date, the SDSS project’s stated goals are to “map the Milky Way, search for extra solar planets and solve the mystery of dark energy.”
SDSS houses hundreds of Terabytes of data relating to its various goals. Release of raw and partially analyzed image data, as phases of the project complete, have thus far occurred in 9 “Data Releases”. Data Release 7 mapped one quarter of the entire sky above the Apache Point Observatory in New Mexico. In addition to the raw data, the SDSS project has made the database of identified objects available. The database allows direct access through SQL queries, but the results are limited: queries are limited to 90 seconds and 100,000 rows and outputs of CSV, HTML, and XML.
We could issue many small queries and build up a CSV. However, Prof. Robert Sedgewick and Prof. Kevin Wayne of Princeton have very nicely collected small percentages [.1% (20MB), 1% (201MB),4% (804MB)] of galaxy objects. As you can see, the files are prepared in a variety of data sizes. Not only have I found an avenue to developing new physics, but I’ve also found a great test set for IOPro, Continuum’s flexible and fast data processing tool.
Numerical Data Loading with IOPro
Unlike last week’s post which contained a lot of text data, these files contain sky coordinates, magnitudes of wavebands from ultraviolet to infrared, and object IDs. These data values are all represented numerically.
Numerical data processing often results in smaller and faster parsing compared to equally sized text data. Just as in last week’s post, I’m interested in how IOPro compares to other Python-based options for processing and loading CSV files into NumPy arrays.
Setup: Lenovo X200, 2.4GHz Core 2 Duo, with 4GBs of RAM, running a 64-bit version of Debian with Linux kernel 3.2.
IOPro can both infer NumPy dtypes and allow users to define dtypes, and I wanted to compare both of these approaches. I also wanted to compare how IOPro managed memory and processing time versus NumPy and Pandas. Lastly, I used a new and novel trick of mixing IOPro and Pandas together which will be discussed in detail below. Originally, I started with the largest data set provided by the Princeton group — 804MB of SDSS galaxy objects. Unfortunately, on my machine, only IOPro was efficient enough with memory usage to not grind my machine to a halt. So, I downgraded to the medium size data provided by the Princeton group: 201MB of SDSS galaxy objects.
Because memory was a constraint with my machine, let’s begin there. (Memory benchmarks were measured using Valgrind’s memory profiler MASSIF.)
As you can see, IOPro does an excellent job. Remember, the original data file is 201MB and IOPro keeps overhead to a bare minimum.
To measure processing speed I used the Python module Timeit — repeating calls severals time to ensure accurate statistics. Below you can see the processing time for automatically-inferred and defined dtype are nearly identical. This feature becomes extremely important when datasets have a large number of columns, and it becomes unwieldy to define types for each one.
We also observe that compared to other data loaders, IOPro is the most efficient.
What these benchmarks tell me is that IOPro is the way to go for fast and efficient numerical parsing. But why do I have to give up Pandas’s great DataFrame interface? Wouldn’t it be ideal to parse files with IOPro and load those NumPy arrays into Pandas? Fortunately, this is not hard to do.
The black bar in the memory and speed plots above show the speed efficiency and memory foot-print of loading data with IOPro, then passing the resulting NumPy array to Pandas. The results are very encouraging. I pay a small penalty for the additional Pandas processing but not much. The memory overhead is only slightly larger than pure IOPro and processing time increases by 14% — a price worth considering for the ease of use of pandas.DataFrame.
The Kepler Among Us
Big Data and Human Learning in the 17th century resulted in the foundational beginnings of classical mechanics. Although Kepler was pushing a force based law prior to his work with Brahe, to fully complete his theory, he needed access to Brahe’s data. In modern times, we have public data resources and open access, and the doors are now wide open to PI and amateur alike to discern Nature’s patterns. However, scientists still need great tools to help them manage their data and their workflows — tools like IOPro and Pandas, which get out of the way and let domain experts focus on their problems.
What new physics will the 21st century Human-Machine Learning yield? I don’t know, but at Continuum, we are developing the tools to help you find out.