This past March, over 50 people came to the PyData Workshop in Mountain View, CA at Google’s offices and over 120 people came to the evening sprint event. Last weekend, amidst the lights and bustle of midtown Manhattan, the PyData NYC conference debuted to a sold-out crowd. With more space and a little more advance notice, over 225 scientists, quants, data scientists, engineers, students and Python enthusiasts gathered just ahead of superstorm Sandy to learn about some of the latest advances in using Python for processing, visualizing, and interpreting data of all kinds.
As one heavily involved in creating and promoting the SciPy community, I am very excited to see the rapid growth of the PyData community. A very talented group of people are coming together to promote the advancement and improvement of Python and its growing set of tools for getting value out of the data deluge that surrounds us. Easily connecting ideas to data via computational processes that include visualization is the key to managing the data that surrounds us. As an accessible programming language with a full stable of advanced calculation libraries, Python empowers the experts who know what to do with data and gives them full control over expert solutions. In the sea of data-photons around us, those who develop eyes will be best positioned to respond effectively to the information reflected by data. Python is a nutrient-rich substrate on which these eyes can be built. PyData gives people a conference and a community to show off both their current eyes as well as new occipital structures enabling even better eyes on data.
I started the conference off with a quick introduction that emphasized the need in Python for out-of-core solutions to large-scale data problems. Just as NumPy and SciPy (and lately Pandas) are extremely useful for processing data that fits in memory with Python, the PyData community is pushing forward with data structures and algorithms that work on out-of-core data-sets that can ultimately reach the size of the internet. I was thrilled that we were able to introduce the beginning of Blaze which is a foundational open-source data structure to make large-scale data analysis even easier with Python. Together with Numba (our Python/NumPy compiler that can give 1000x speed-ups to your code) and Bokeh (web-based plotting and visualization), those paying attention can begin to see some of our vision for large-scale data analysis with Python.
While Continuum Analytics is presenting the PyData conferences, the PyData community goes far beyond our company. So many great speakers and great sponsors stepped up to make PyData NYC 2012 a success. DE Shaw and Appnexus were both Gold sponsors for the event and Appnexus also sent Dave Himrod and Steve Kannon to give a great Keynote talk about how Python is used to handle 14 TB of daily log-data to serve up relevant ads. JPMorgan was incredibly generous and donated space for 200 of us to gather for the PyData sprint and extended hall-way track. The views from their office building were amazing and although the turn-out for sprints was small because of Hurricane Sandy, those who came enjoyed one-on-one time with developers of NumPy, scikit-learn, Numba, Blaze, and statsmodels.
The Python Software Foundation (PSF) were sponsors at the Silver level and also sent their current chairman of the board, Van Lindberg, to give a very interesting keynote address that discussed lessons learned in scaling an IT migration, a natural-language patent-search tool, and now the PSF itself. It was very impressive to see a lawyer presenting to a highly technical audience and still managing to teach everyone. NumFOCUS, the Foundation behind NumPy, SciPy, IPython, Matplotlib, PyTables and many other PyData technologies, was a student sponsor and enabled the conference to pay for travel and expenses for several students including Skipper Seabold (statsmodels) and Jake Vanderplas (sklearn). The foundation received a set-back two months ago with the loss of John Hunter, so it was very nice to see Leah Silen promoting the Foundation at the conference and letting everyone know that (due to her tireless efforts) NumFOCUS has finally received recognition as a 501(c)3 public-charity. Donations to NumFOCUS are now fully tax-deductible. If you use the scientific-python stack and have been wondering how to give back, your donations will support technical fellowships, equipment grants, development sprints, and continuous integration for all the projects you use regularly.
It was amazing to reconnect with old-friends and make new friends at the conference. I was genuinely happy to see all of the great things that are happening in this space. While all of the talks were educational and interesting, my favorite part of the conference was definitely the panel discussions. It was great to hear real-world experience integrating Hadoop and Python and hear two clear messages from real users: 1) Disco is a better-performing alternative to a wide-variety of scaling needs which integrates much more easily with Python, and 2) the first thing to do with any scaling problem is to step back and see how much you can solve with fewer nodes with more memory, more disk-space, and with better algorithms. One anecdote indicated that a solution that took 2 hours to solve with Hadoop on a cluster (not including setup time) took 2 minutes to solve on a single machine with Python. It was also clear that Hadoop has a place when you have a tremendous number of log-files or other unstructured data and don’t really care about your compute-usage footprint. In addition, there are very good solutions for integrating Python with Hadoop that are emerging (e.g. Mortar Data).
The other panel discussion I really enjoyed was the Parallel Python panel. Andy Terrel, Thomas Wiecki, Brian Granger, and Andreas Klockner provided very insightful discussion about the current state-of-the-art and future of multi-core processing, parallel patterns, and GPU-based computing. The message seems to be that parallel computing is difficult, future hardware will demand it, and Python with its large collection of libraries is well-poised to empower developers to take advantage of all of their available hardware.
I missed many of the talks due to the two tracks, conference organizing duties, and discussions with all of the many people that were present. Fortunately, the talks were recorded by Match Point Productions who were also Student sponsors to the conference. The talks should be available online by the end of this week. We will let you know when they are available. Thanks to our amazing sponsors and speakers, the future of PyData is very bright and we are looking forward to our next conference in Silicon Valley, March 18-20. Take advantage of our early registration price and register today.