With data volumes growing, the ability to manage and analyze data sets is increasingly important to many governmental agencies. Recently, we were awarded $3 million to develop Blaze and Bokeh as part of the Defense Advanced Research Projects Agency’s (DARPA) XDATA program. The XDATA program seeks to develop new computational techniques and open-source software tools for processing and analyzing both semi-structured and unstructured data. Continuum’s development of Blaze and Bokeh will help address scalability, interactivity and extensibility for big data processing and visualization at its core.
As explained in a previous post, Blaze is the next generation of NumPy. It aims to extend the structural properties of NumPy arrays to a wider variety of table and array-like structures while supporting commonly requested features such as missing values, type heterogeneity, and labeled arrays. Blaze is designed to handle out-of-core computations on large data sets that exceed the system memory capacity, as well as on distributed and streaming data.
The development of Blaze contributes to the first part of the XDATA research effort of creating a scalable analytics and data processing technology. Blaze will combine multi-dimensional arrays (familiar to scientists) and relational tables (familiar to business analysts) into a single multi-dimensional table structure using a global N-dimensional table. As an extension of NumPy and SciPy, the system will handle missing data by using masks or special data-types with “not available” bit-patterns and will have a pluggable data-type system allowing any kind of data to be represented and handled by the library. Continuum developers will also explore the use of random-variable data-types and adapted calculations for users, which allows for the propagation of uncertainty throughout the system and reduces the number of passes over data needed.
In addition to Blaze, our team will develop a suite of linear algebra, statistical (including predictive modeling), image and signal processing, optimization, machine-learning, and interpolation tools that will work on these large-scale global arrays using out-of-core algorithms. This extensive library will allow experts and novices to create sophisticated transformations and analysis and then publish their results easily for others to use in downstream analyses.
The second part of DARPA’s XDATA initiative is the creation of a visual user interface technology through which users can interactively explore data and gain a better understanding of unknown activities and associated relationships. To address this concern, the Continuum team is developing Bokeh, a scalable, interactive and easy-to-use visualization system for exploration of large, multidimensional data sets that incorporates an adaptive UI.
The core of our visualization system is a hybrid scene graph/data flow graph that can represent the structure of the input data as well as the data transforms at a very low, renderer level. Using Python, Bokeh will combine the Stencil visualization model for primitive glyphs and simple vector expressions with the easy-to-use, flexible multidimensional mapping of Leland Wilkinson’s Grammar of Graphics. End users will be able to write simple expressions that map dimensions of the input dataset to panels and facets of the layout, while transforming values of the input dataset to aesthetic properties of the graphical geometry for each facet. Bokeh will combine these approaches to make data analysis available on a larger scale and to allow for back channel feedback from any stage.
We believe that by offering non-programmer data analysts the ability to customize and control all stages of the graphics pipeline via simple mathematical and predicate expressions, we will unlock tremendous innovation for novel renderings that aid in exploration of large datasets.
DARPA has assembled a large, diverse team of collaborators to tackle many facets of the scalable analytics and visualization problem for Department of Defense purposes but with broad applicability to many businesses. The entire Continuum team is excited to be working in such excellent company, and we are eager to tackle the challenging problems that DARPA will be presenting us with. Stay tuned to the Continuum Blog for more progress reports as the XDATA effort gets underway, and we start collaborating with our fellow awardees.