In 1987 Martin H. Krieger wrote an article for the American Journal of Physics titled “The Physicist’s Toolkit”. Krieger develops the concept of a toolkit for physicists and discusses how thinking about doing science as a craft leads to a “more concrete approach.” He writes:
"We may think of scientific work as a craft with an associated toolkit, practiced in guild. The number and kind of tools and practices can be seen as small, teachable, and specifiable."
Similarly, Greg Wilson has taken this idea quite literally; he and his team help scientists to “be more productive by teaching them basic computing skills” in the Software Carpentry classes they offer.
Last week, Continuum released a toolkit of our own: Anaconda Pro. These tools, libraries, and modules are selected specifically to help domain experts and data scientists manage data and design work flows to efficiently process large collections of data.
In addition to the bundled open-source tools found in AnacondaCE, we’ve added tools developed internally at Continuum: IOPro and NumbaPro. We have also partnered with wise.io to bring speed and memory optimized machine learning algorithms.
In the the previous two posts we’ve explored IOPro, Continuum’s fast and memory-efficient data parsing tool. A critical step in the exploratory phase of data science is munging or cleaning. Having reliable tools like IOPro, which can infer type and keep memory overhead to a minimum, is key to data exploration. Additionally, once the phase of data exploration ends and path towards analysis is established, IOPro proves useful as the data size increases and the analysis must also scale. Parsing thousands, tens of thousands, and even millions of files (containing millions of lines) shouldn’t be part of the big data discussion. Users should expect parsing to “just work”. It should be done quickly and intelligently — it should, in other words, become a background process, giving the user one less concern in navigating Big Data analytics.
Last week we also released NumbaPro: an Array-oriented Python Compiler for NumPy. NumbaPro is designed for the Big Data Python developer, hacker, data scientist, etc. who wants the speed of C with the ease of Python. NumbaPro compiles, at runtime, pure-Python expressions of vectorized operations. This means that element-wise Python code can achieve C-like performance, while avoiding the low-level bugs that plague numerical C programs. Additionally, NumbaPro can target multi-core architectures and GPUs, and generate code for these automatically from pure Python. NumbaPro uses the open-source Numba and LLVM projects in order to perform dynamic compilation, and works the same across the major platforms (Windows, Mac, Linux).
Although Python is a powerful tool for data processing and analysis, one of the headaches that users frequently encounter is the issue of packaging: finding the right version of a package, managing which are installed, and then replicating a developer’s configuration of packages to a production environment.
Anaconda Pro includes a premium Launcher feature for solving this problem. Users can select configurations of various versions of Python, Numpy, Scipy, and other modules, and bundle these into a named “environment”. Each environment behaves as if it were an entirely self-contained installation of Python, with the corresponding packages installed. Whether running cluster jobs or logging in interactively to a single node, you can easily select which environment to use via the “conda” command, or by directly invoking the python interpreter from the /bin directory of the desired environment.
Of course, the Launcher in Anaconda Pro also includes all of the features from Anaconda Community Edition: a centralized view for a Disco cluster, a convenient location to look up documentation for popular Python packages, and a web-enabled Terminal client to log into any cluster machine running Anaconda.
The wise.io team collectively brings decades of expertise to developing Random Forest algorithms. Typically, RF is used to solve classification problems. While there are many implementations of RF, wise.io has developed the fastest and RF implementation to date. With WiseRF baked into Anaconda Pro, classification will be faster, cheaper, and more likely to lead to actionable results.
Anaconda Pro — Multi-Tool for Big Data
At Continuum, we want to support the Big Data craftsperson and help them to develop their craft. Half of that support is in the selection of tools — tools like Disco, which provides a Python-based MapReduce Framework.
The other half is first recognizing that either some tools are deficient in some manner or do not yet exist; and second, building those tools with a big focus on efficiency and usability. There’s a reason why the hammer is the most sought after nail-injecting device. There are any number of methods and designs which could accomplish similar tasks, but a hammer is simple to construct and easy to use. Continuum’s software tools, and those found within Anaconda Pro, are similarly designed for the Big Data craftsperson in mind: easy to use with powerful results.
At the end of Krieger’s Physicists’ Toolkit discussion, he writes:
When the toolkit does not work or becomes of limited interest, new tools and practices need to be invented, new probes, filters, and models created. The new tools may be seen as adaptations of the old, even if they are used rather differently from them.
It is to this end — that is, adapting the new to the old and using a set of “small, teachable, and specifiable” tools — that we created Anaconda Pro. Some of these changes will be incremental, but we hope to build tools which shift the conversation and ideas surrounding data management — so much so that the end user isn’t thinking about data management, they’re focused on interpretation and relaying results to action.