With our most recent Anaconda release, there are now two ways to use Anaconda on a Hadoop cluster, including alongside a Cloudera CDH cluster.

  1. The Anaconda parcel for Cloudera CDH, which was described in a previous blog post, allows you to easily install and use Anaconda on your Cloudera CDH cluster. Features of the Anaconda parcel include:

    • Easily bootstrap Anaconda across a CDH cluster

    • Includes 300+ of the most popular Python packages

    • Freely available as part of the Anaconda distribution

  2. Anaconda for cluster management, which is included with the Anaconda Workgroup and Anaconda Enterprise subscriptions, gives you the full power and flexibility of Anaconda on your cluster. Features of Anaconda for cluster management include:

    • ​​Dynamically manage Python, R and other conda packages on your cluster

    • Manage multiple conda environments across cluster nodes

    • Integrates with an on-premises Anaconda Repository

    • Works alongside your bare-metal or cloud-based cluster, including existing Hadoop clusters

Anaconda for cluster management works alongside your existing enterprise Hadoop cluster (Cloudera, Hortonworks, Pivotal, MapR, etc.) to help you manage Python, R and other conda packages and environments. Anaconda for cluster management is certified for use with Cloudera CDH. To get started with Anaconda for cluster management, you can download our cluster cheat sheet.

Whether you’re using your cluster for distributed SQL queries, text and language processing, image analysis with GPUs or machine learning, Anaconda for cluster management gives your data scientists and analysts access to the Python and R packages that they know and love on your existing Hadoop cluster.

Here’s an example use case with an Anaconda-powered Hadoop cluster. This example uses Numba and SciPy along with PySpark for GPU-accelerated high-performance image processing:

Here’s another example use case with an Anaconda-powered Hadoop cluster for interactive distributed SQL queries on text data. This example uses Bokeh for interactive plotting and Blaze and pandas to explore data using Hive and Impala:

With Anaconda for cluster management, you don’t need to manually install or manage Python and R dependencies for PySpark or SparkR. We make it easy for you to import numpy, scipy, pandas, nltk, numba, bokeh and all of the other packages that you know and love, all on your own cluster!

In summary, the Anaconda parcel makes it easy to get started with Anaconda on a Cloudera CDH cluster, and Anaconda for cluster management gives you even more power and flexibility to manage Python/R packages and multiple conda environments on a cluster.


For more information about managing Python/R packages and environments on your cluster, or to test drive the on-premises enterprise features of Anaconda, contact [email protected]. View the full list of Anaconda integrations for Amazon EC2, Azure, Docker, Vagrant, and more.