tl;dr Blaze aids exploration by supporting full databases and collections of datasets.
This post was composed using Blaze version 0.6.6. You can follow along with the following conda command.
conda install -c blaze blaze=0.6.6
When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked.
This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to.
Motivating problem – Intuitive HDF5 File Navigation
For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files.
Typically we navigate these files in Python with
HDF5 files organize datasets with an internal file system. The
h5py library accesses this internal file system through successive calls to
.keys() and item access.
This interaction between programmer and interpreter feels like a long and awkward conversation.
Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together.
Exploring a SQL Database
For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name.
This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table.
db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers.
Blaze isn’t doing any work, SQLAlchemy is.
SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end.
We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species.
Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas.
A Larger Database
Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables.
Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections.
We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine.
And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year
Looks good, we compute and store to CSV file with
(Final result here: salaries.csv)
SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files.
None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame.
Finish with HDF5 example
To conclude we revisit our motivating problem, HDF5 file navigation.
Recall that we previously had a long back-and-forth conversation with the Python interpreter.
H5Py with Blaze expressions
With Blaze we get a quick high-level overview and an API that is shared with countless other systems.
By default we see the data and a truncated datashape.
We ask for the datashape explicitly to see the full picture.
When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped
(1643, 60); the collection in
Data Fields holds what appear to be weather measurements while the collection in
Geolocation Fields holds coordinate information.
Moreover we can quickly navigate this structure to compute relevant information.
Often the greatest challenge is finding what you already have.
Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results.
This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend.