Several Continuum developers and I were invited to speak at the 93rd American Meteorological Society conference. I took the opportunity to learn a bit about climate science and set out to reproduce the Global Temperature Anomaly also known as Global Warming. I’m going to take you through the promises and pitfalls of doing climate science with public data using Disco.
Caveat: I am not a climate scientist, and this post should only serve as an illustration of the benefits of Disco. However, if any climate scientists are interested in helping to improve the science, please contact me!
The National Oceanic and Atmospheric Administration (NOAA) helps to maintain over 10,000+ weather stations and deliver world meteorological data. The data is ftp accessible and includes global surface summaries of day data (GSOD): average temperature, pressure, wind speed, max/min temperature, etc. The data for each station is stored in a gzipped csv file and are organized by year. For example, if we wanted to get the data for 2002 from station 725826-99999, which is in Austin, Texas, we’d download 725826-99999-2002.op.gz from the 2002 directory. Luckily, IOPro can process gzip files extremely efficiently in both time and memory footprint.
After exploring the data, I quickly became aware that, like all data, the NOAA dataset is messy. Stations have missing dates, missing data, stations don’t persist through time, etc. So, instead of downloading the full dataset (20GB+), we’re going to write a chained MapReduce solution which can navigate the messy data by filtering and downloading only the necessary files. Again, I want to use the NOAA GSOD data set to demonstrate a rise in global average temperatures per year. I’m (naively) going to calculate an average temperature per year across several hundred stations spread out across the globe.
MapReduce Job Layout
In a past blog post, I’ve discussed the benefits and how to chain Disco jobs one after another. Below I’m leveraging a chain of MapReduce jobs, passing the results of successive filters to a simple calculation of averaging a list of temperatures for the year. Many data processing jobs can be parallelized in a MapReduce scheme for more efficient computing and I encourage you rethink how cleansing and filtering steps can be distributed across a cluster.
Remember, the goal of MapReduce is to spread the work across all nodes in a cluster and, if possible, push code to the data. You can generally view the above as a series of data munging tasks. Distribute list of stations->filter, distribute filtered list->filter again…perform calculation.
Note: I read that in 1973 there was a significant shift in the number of reporting stations. So, that’s where I decided to start my calculations, and I decided to end them in the last full year of data, 2012.
- Map 1: Map list of Files from 1973 to be downloaded
- Reduce 1: Filter list to only include stations with good coverage
- Map 2: No-Op
- Reduce 2: Take union of sets from Reduce 1
- Map 3: No-Op
- Reduce 3: Filter list to find stations which persist until the present
- Map 4: No-Op
- Reduce 4: Take intersection of sets from Reduce 3
- Map 5: No-Op
- Reduce 5: Filter list to find stations which have good coverage for each year
- Map 6: No-Op
- Reduce 6: Take intersection of sets from Reduce 5
- Map 7: Map list of stations for year. Key = Year, Value = Station ID
- Reduce 7: Calculate Average and Standard Deviation for all stations in a year
The total number of MapReduce steps could be reduced but this would obfuscate the straightforward flow as described above. I also wanted to emphasize that this particular job spends the majority of time filtering and cleaning data and not doing any calculation. MapReduce lets me manage the larger job by breaking it up into smaller tasks, and, if we’re lucky, re-usable code.
Again, I’m not a climate scientist, and while the plot above indicates a rising trend not too dissimilar from the accepted Global Temperature Anomaly, it’s incorrect. The above predicts an increase of 0.3°C per decade since 1973. NOAA reports that the global annual temperature increases at a rate of 0.16°C since 1970. This is a significant difference when it comes to global warming rates. Why do we have such a discrepancy?
Weather Calculations are Hard!
It turns out that we cannot simply filter NOAA and calculate averages. NOAA provides a detailed FAQ outlining many of the problems with calculating average global temperatures. They indicate the data is difficult to compile, for many of the reasons I illustrated above, and that stations are not equally dispersed throughout the globe.
Above is the set of 722 stations used in my calculations of average global temperatures. Notice the lack of reporting stations from Africa, Antarctica, Eastern Europe and Russia, South America, etc. And this is only land coverage! We are also missing the 71% of ocean which covers the Earth’s surface. We have a clear problem of over- and under-sampling large swaths of land and ocean.
We can make a number of adjustments to properly calculate average global temperatures:
- Gather data from the sea as reported by ICOADS
- Support the creation of CICERO, a new approach to gathering global climate data.
- Use statistical techniques to combine sparse data.
Conclusion and A Plea
Often when I set out to do any data analysis I quickly run into the molasses of data munging and get bogged down in an ever sprawling amount of code. What I tried to illustrate is that Disco and MapReduce push me to organize my thoughts, reduce the bulk of a job into modular code pieces, and offer some measure of reduced computing time. When I presented this as a talk at American Meteorological Society a number of climate scientist conveyed the difficulties in performing global anomaly calculations. Aside from the Smith et al., 2008 paper, I have had difficulty finding any detailed examples — or preferably coding examples — of calculating the average global temperature anomaly time-series.
As a parting thought, I wanted to mention something which famed Python Hacker and Geneticist Titus Brown pointed me to recently. In addition to being a fellow advocate for reproducible research, he linked to a blog specifically discussing the lack of published code within the climate science community. The author summarizes the reasons for not sharing code as:
- Scientists will look foolish because of poor coding and/or bugs
- Code has leveraged new techniques which can advance an individual’s career
I have discussed the issue of reproducible research in the past and I have indicated that there are problems with my work on properly calculating the global temperature anomaly.
At Continuum, we are trying to tackle this problem by building Wakari, our cloud-based data analysis environment. Not only does it place all of the necessary tools for powerful data processing and visualization in your browser, the fact that both code and data are hosted in the cloud means that it’s extremely easy to share your work with others in a reproducible way.