What’s new in Blaze

tl; dr: We discuss the latest features and user facing API changes of Blaze.

Blaze has undergone quite a bit of development in the last year. There are some exciting features to discuss with some important user facing API changes.

Getting Blaze

The best way to install the latest development build of Blaze is through Conda, from the blaze Anaconda.org channel:

conda install blaze --channel blaze

Blaze is also pip installable, though pip generally is less reliable about bringing in all the correct versions of Blaze’s dependencies.

pip install git+http://github.com/ContinuumIO/blaze --upgrade

API changes

Use odo instead of into for new code

The into library has been renamed to odo. When using into from Blaze, we suggest that users use odo instead. The argument order is reversed. For example,

# instead of this
#         target,       source
df = into(pd.DataFrame, '/path/to/file.csv')
 
# use this
#        source,              target
df = odo('/path/to/file.csv', pd.DataFrame)

This makes it more like command line utilities such as cp.

Data for interactive use

The fastest, easiest way to get up-and-running with Blaze is to use the Data constructor. This used to be called Table, which is now deprecated. Data accepts a variety of objects. Here are a few examples:

Database-like things

Blaze supports exploration of database-like objects. For example, if I have a dict of pandas DataFrames, I can use Blaze to explore that dict as if it were a database:

In [14]: people = pd.DataFrame({'id': [1, 2, 3],
   ....:                        'name': ['Karen', 'Joe', 'Bob']})
 
In [15]: cities = pd.DataFrame({'name': ['Karen', 'Joe', 'Bob'],
   ....:                        'city': ['Anchorage', 'New York', 'Austin']})
 
In [16]: db = {'people': people, 'cities': cities}
 
In [17]: d = Data(db)
 
In [18]: d
Out[18]:
Data:       {'cities':         city   name
0  Anchorage  Karen
1   New York    Joe
2     Austin    Bob, 'people':    id   name
0   1  Karen
1   2    Joe
2   3    Bob}
DataShape:  {
  cities: 3 * {city: ?string, name: ?string},
  people: 3 * {id: int64, name: ?string}
  }
 
In [19]: d.cities
Out[19]:
        city   name
0  Anchorage  Karen
1   New York    Joe
2     Austin    Bob
 
In [20]: d.people
Out[20]:
   id   name
0   1  Karen
1   2    Joe
2   3    Bob
 
In [21]: d.people.name
Out[21]:
    name
0  Karen
1    Joe
2    Bob
 
In [21]: type(d.people)
Out[21]: blaze.expr.expressions.Field

Notice that type(d.people) is blaze.expr.expressions.Field. The Field expression is Blaze’s way of saying “attribute of.” In this case, people is an attribute of the d “database” (really just a Python dict). Fields are also used to access individual columns of a table.

Pandas DataFrames

Data also works with the well-known and loved Pandas DataFrames

In [24]: p = Data(people)
 
In [25]: p
Out[25]:
   id   name
0   1  Karen
1   2    Joe
2   3    Bob
 
In [26]: p.id + 1
Out[26]:
   id
0   2
1   3
2   4
 
In [27]: p.id.sum()
Out[27]: 6
 
In [28]: p.id.
p.id.apply             p.id.map               p.id.shape
p.id.count             p.id.max               p.id.sort
p.id.count_values      p.id.mean              p.id.std
p.id.distinct          p.id.min               p.id.sum
p.id.dot               p.id.ndim              p.id.truncate
p.id.dshape            p.id.nelements         p.id.utcfromtimestamp
p.id.fields            p.id.nrows             p.id.var
p.id.head              p.id.nunique           p.id.vnorm
p.id.isidentical       p.id.relabel
p.id.label             p.id.schema
 
In [28]: p.name.
p.name.apply         p.name.head          p.name.min           p.name.schema
p.name.count         p.name.isidentical   p.name.ndim          p.name.shape
p.name.count_values  p.name.label         p.name.nelements     p.name.sort
p.name.distinct      p.name.like          p.name.nrows         p.name.strlen
p.name.dshape        p.name.map           p.name.nunique
p.name.fields        p.name.max           p.name.relabel

Note the tab completion above. Blaze knows that the id field has type int64, and therefore certain operations such as strlen (string length) don’t make sense. It also knows that name is a string and correspondingly operations like mean don’t make sense. These methods are created based on the datashape of the expression, so you will only see methods and properties that are well-defined on the expression’s type.

SQLAlchemy URIs

Blaze leverages SQLAlchemy for talking to databases of all kinds. SQLAlchemy provides a nice system for constructing database engines, connections, and tables through URIs (uniform resource identifiers) as strings. Let’s take our people and cities tables and throw them into a SQLite database.

In [2]: from odo import odo
 
In [3]: sql_cities = odo(cities, 'sqlite:///db.db::cities')
 
In [4]: sql_people = odo(people, 'sqlite:///db.db::people')
 
In [5]: db = Data('sqlite:///db.db')
 
In [6]: db.cities
Out[6]:
        city   name
0  Anchorage  Karen
1   New York    Joe
2     Austin    Bob
 
In [7]: db.people
Out[7]:
   id   name
0   1  Karen
1   2    Joe
2   3    Bob

Again, we access tables as fields of the database. Here we use odo for data movement. Odo is a dependency of Blaze.

New features

With these API changes in mind we can move on to some exciting new features!

HDFS integration

Blaze can now talk to the various databases in the Hadoop and Spark ecosystem.

Cloudera Impala is a database that allows users to run SQL queries on top of HDFS. It is generally much faster than comparable systems like Hive.

To fire it up, you simply pass in a SQLAlchemy URI to the Data constructor:

In [1]: from blaze import Data, by
 
In [2]: data = Data('impala://hostname/default::data')
 
In [3]: by(data.name, avg_amount=data.amount.mean())

Blaze uses the impyla library, which implements a SQLAlchemy dialect for Impala to generate and execute SQL queries against Impala.

Similar to Impala, we can easily start playing with a Hive database (assuming you have one on localhost). We simply pass in a URI referencing the database to the Data constructor:

In [1]: from blaze import Data, by
 
In [2]: data = Data('hive://hostname/default')
 
In [3]: by(data.name, avg_amount=data.amount.mean())

One can also use Data('hive://hostname') as default as the name of the default database in Hive.

The Hive backend is powered by a Hive SQLAlchemy dialect, created by the nice folks at Dropbox.

Apache Spark integration

Blaze supports both Spark 1.2.0 and the new Spark DataFrame API available in Spark 1.3.0. As with Impala and Hive, we use SQLAlchemy to generate SQL in the Hive dialect and then pass that SQL to the SQLContext.sql method.

The easiest way to install Spark is from the Blaze channel.

Spark 1.2.0:

conda install spark==1.2.0 pyhive -c blaze
# or
conda install spark==1.3.0 pyhive -c blaze

Unfortunately URI syntax isn’t available for Spark yet so there’s a bit more setup involved:

In [1]: from pyspark import SparkContext, HiveContext
 
In [2]: sc = SparkContext('local[*]', 'ipython')
Spark assembly has been built with Hive, including Datanucleus jars on classpath
sql =
In [3]: sql = HiveContext(sc)
 
In [4]: from odo import odo
 
In [5]: people = pd.DataFrame({'id': [1, 2, 3], 'name': ['Karen', 'Joe', 'Bob']})
 
In [6]: odo(people, sql, name='people')
Out[6]: DataFrame[id: bigint, name: string]
 
In [7]: d = Data(sql)
 
In [8]: d
Out[8]:
Data:       <pyspark.sql.context.HiveContext object at 0x1116a1450>
DataShape:  {people: var * {id: int64, name: ?string}}
 
In [9]: d.people
Out[9]:
   id   name
0   1  Karen
1   2    Joe
2   3    Bob
 
In [10]: d.people.id.sum()
Out[10]: 6

Pandas HDFStore integration

Pandas HDFStore is a wrapper around various PyTables data structures. Blaze supports the HDFStore table format:

In [4]: from odo import odo
 
In [5]: hdfstore_people = odo(people, 'hdfstore://db.h5::people')
 
In [6]: d = Data('hdfstore://db.h5')
 
In [7]: d
Out[7]:
Data:       <class 'pandas.io.pytables.HDFStore'>
File path: db.h5
/people            frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])
DataShape:  {people: 3 * {id: int64, name: ?string}}
 
In [8]: d.people
Out[8]:
   id   name
0   1  Karen
1   2    Joe
2   3    Bob
 
In [9]: d.people.id.mean()
Out[9]: 2.0

Blaze is under active development

Blaze is still under heavy development and will remain that way for a while. Certain parts of the API are stable, while others are not. We welcome contributions of any kind whether they are pull requests, bug reports or simply a comment on what you did or did not like about some aspect of library.


About the Author

Phillip enjoys building Python based tools that make others’ experience with data analysis a pleasure. He graduated with a BS and MA in psychology from CUNY City College where he worked on quantifying the activity of spontaneously firi …

Read more

Join the Disucssion