One of IOPro’s more advanced features is the ability to work with csv data directly from Amazon’s S3 service. Combining this with Wakari’s notebook sharing feature gives us a powerful way to collaborate on data analysis using data stored in the cloud. To illustrate this, let’s use a dataset available on the US Federal Election Commission’s website that lists every donation made to the US presidential candidates in the 2012 election. I’ve used this dataset extensively to test and benchmark IOPro, but have never explored it just for fun. The code examples in this post are available in an IPython notebook which can be imported into Wakari.
First off, let’s upload our data to S3 so others can use it too. Before doing that though, we’ll compress our csv file to save on storage space and bandwidth. Using gzip, we can get our csv file from 941MB down to 155MB. Also, we’ll go ahead and generate an index file that we’ll upload along with the csv file:
This creates a 27MB index file that will allow us to do fast random lookups in the FEC data. By naming the index file the same as the csv file with a ‘.idx’ extension added, IOPro can find the index file automatically once uploaded to S3.
After uploading the csv file and index file to S3, let’s try reading a few records from an IPython notebook in wakari. In order to try this yourself, you’ll need to use your own S3 access key and secret key:
Since we’ve created an index file, we can do fast lookups in the middle or end of the dataset without downloading and parsing a bunch of extra records:
Now let’s go ahead and retrieve only the fields we need:
Upgrading to one of Wakari’s premium plans will give you more features and more memory to play with, but the 512MB available in the free trial plan should be adequate for this simple example.
Now we can start playing around with our dataset. How about finding out what states donate the most money per capita to presidential campaigns? First we’ll need some data on state populations. The census.gov website provides a csv file of state populations, which we can read straight into a NumPy array with IOPro.
Since the census data uses state names as keys, and our FEC dataset uses state abbreviations as keys, we need a list of states with matching abbreviations.
Now let’s calculate the total sum of campaign contributions for each state.
Finally we’ll calculate per capita campaign contribution amounts using the census data, and plot the results.
Wow, the District of Columbia blows away the competition, which isn’t too surprising given that it’s the center of American politics. The state with the second most contributions, Illinois, is a bit of a surprise, especially since it is clearly ahead of the relatively wealthy state of Massachusetts. Illinois does contain the city of Chicago, where US president Barack Obama maintains a home away from the White House. Could hometown enthusiam for Obama have helped open the wallets of Illinois residents?
This is just a very quick and dirty example of using this dataset. Of course, it’s only the tip of the iceberg. Import the IPython notebook today into your wakari account to further explore this dataset!