One of IOPro’s more advanced features is the ability to work with csv data directly from Amazon’s S3 service. Combining this with Wakari’s notebook sharing feature gives us a powerful way to collaborate on data analysis using data stored in the cloud. To illustrate this, let’s use a dataset available on the US Federal Election Commission’s website that lists every donation made to the US presidential candidates in the 2012 election. I’ve used this dataset extensively to test and benchmark IOPro, but have never explored it just for fun. The code examples in this post are available in an IPython notebook which can be imported into Wakari.

First off, let’s upload our data to S3 so others can use it too. Before doing that though, we’ll compress our csv file to save on storage space and bandwidth. Using gzip, we can get our csv file from 941MB down to 155MB. Also, we’ll go ahead and generate an index file that we’ll upload along with the csv file:

import IOPro
adapter = IOPro.text_adapter('P00000001-ALL.csv.gz',



This creates a 27MB index file that will allow us to do fast random lookups in the FEC data. By naming the index file the same as the csv file with a ‘.idx’ extension added, IOPro can find the index file automatically once uploaded to S3.

After uploading the csv file and index file to S3, let’s try reading a few records from an IPython notebook in wakari. In order to try this yourself, you’ll need to use your own S3 access key and secret key:

access_key = '' # PUT YOUR ACCESS KEY HERE
secret_key = '' # PUT YOUR SECRET KEY HERE
adapter = IOPro.s3_text_adapter(access_key, secret_key,
    'dev-wakari-public', '/FEC/P00000001-ALL.csv.gz',



Since we’ve created an index file, we can do fast lookups in the middle or end of the dataset without downloading and parsing a bunch of extra records:




Now let’s go ahead and retrieve only the fields we need:

data = adapter[["cand_nm", "contbr_city", "contbr_st",
                "contb_receipt_amt", "contb_receipt_dt"],][:]



Upgrading to one of Wakari’s premium plans will give you more features and more memory to play with, but the 512MB available in the free trial plan should be adequate for this simple example.

Now we can start playing around with our dataset. How about finding out what states donate the most money per capita to presidential campaigns? First we’ll need some data on state populations. The website provides a csv file of state populations, which we can read straight into a NumPy array with IOPro.

census2010_url = ''
census_adapter = IOPro.text_adapter(census2010_url)
state_populations = census_adapter[["Name","CENSUS2010POP"],][:]



Since the census data uses state names as keys, and our FEC dataset uses state abbreviations as keys, we need a list of states with matching abbreviations.

state_abbrev = IOPro.s3_text_adapter(access_key, secret_key,
    'dev-wakari-public', '/FEC/states.csv')[:]



Now let’s calculate the total sum of campaign contributions for each state.

import numpy as np
sums = {}
for state in state_abbrev:
    state_mask = data["contbr_st"] == state["Abbreviation"]
    sums[state["State"]] = data[state_mask]["contb_receipt_amt"].sum()



Finally we’ll calculate per capita campaign contribution amounts using the census data, and plot the results.

per_capita = np.array([],
    dtype=[('Name', 'O'),('Abbrev', 'O'),('Donations Per Capita', 'f4')])
for state in state_abbrev:
    state_mask = state_populations["Name"] == state["State"]
    per_capita[-1] = (state["State"], (sums[state["State"]] /
np.recarray.sort(per_capita, order=['Donations Per Capita', 'Name', 'Abbrev'])
x = np.arange(len(per_capita))
fig = pylab.figure(figsize=(20,6))
ax = fig.add_subplot(111), per_capita['Donations Per Capita'])
ax.set_xticks(x + 0.5)
ax.set_title('Campaign Contributions Per Capita in US Dollars');



Wow, the District of Columbia blows away the competition, which isn’t too surprising given that it’s the center of American politics. The state with the second most contributions, Illinois, is a bit of a surprise, especially since it is clearly ahead of the relatively wealthy state of Massachusetts. Illinois does contain the city of Chicago, where US president Barack Obama maintains a home away from the White House. Could hometown enthusiam for Obama have helped open the wallets of Illinois residents?

This is just a very quick and dirty example of using this dataset. Of course, it’s only the tip of the iceberg. Import the IPython notebook today into your wakari account to further explore this dataset!

About the Author

Q. What is your superpower(s)?

A. Developer

Q. What is your technical specialty or area of research?

A. Software developer for IOPro, NumPy and Numba

Q. What world ch …

Read more

Join the Disucssion