Mar 6, 2017

Self-Service Open Data Science: Custom Anaconda Management Packs for Hortonworks HDP and Apache Ambari

Anaconda Team

5min

As part of our partnership with Hortonworks, we’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda management packs for the Hortonworks Data Platform (HDP) and Apache Ambari. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.

The ability to generate custom Anaconda management packs makes it easy for system administrators to provide data scientists and analysts with the data science libraries from Anaconda that they already know and love. The custom management packs allow Anaconda to integrate with a Hortonworks HDP cluster along with Hadoop, Spark, Jupyter Notebooks, and Apache Zeppelin.

Data scientists working with big data workloads want to use different versions of Anaconda, Python, R, and custom conda packages on their Hortonworks HDP clusters. Using custom management packs to manage and distribute multiple Anaconda installations across a Hortonworks HDP cluster is convenient because they work natively with Hortonworks HDP 2.3, 2.4, and 2.5+ and Ambari 2.2 and 2.4+ without the need to install additional software or services on the HDP cluster nodes.

Deploying multiple custom versions of Anaconda on a Hortonworks HDP cluster with Hadoop and Spark has never been easier! In this blog post, we’ll take a closer look at how we can create and install a custom Anaconda management pack using Anaconda Repository and Ambari then configure and run PySpark jobs in notebooks, including Jupyter and Zeppelin.

Generating Custom Anaconda Management Packs for Hortonworks HDP

For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 730 conda packages that are available in the Anaconda distribution and repository. We’ve also installed Hortonworks HDP 2.5.3 along with Ambari 2.4.2, Spark 1.6.2, Zeppelin 0.6.0, and Jupyter 4.3.1 on a cluster.

In Anaconda Repository, we can see feature for Installers, which can be used to generate custom Anaconda management packs for Hortonworks HDP.

The Installers page describes how we can create custom Anaconda management packs for Hortonworks HDP that are served directly by Anaconda Repository from a URL.

After selecting the Create New Installer button, we can then specify the packages that we want to include in our custom Anaconda management pack, which we’ll name anaconda_hdp.

Then, we specify the latest version of Anaconda (4.3.0) and Python 2.7. We’ve added the anaconda package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to kickstart your custom Anaconda management pack with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.

In addition to the packages available in Anaconda, additional Python and R conda packages can be included in the custom management pack, including libraries for natural language processing, visualization, data I/O and other data analytics libraries such as azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow or xarray.

We could have also included conda packages from other channels in our on-premises installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.

When you’re ready to generate the custom Anaconda management pack, press the Create Management Pack button.

After creating the custom Anaconda management pack, we’ll see a list of files that were generated, including the management pack file that can be used to install Anaconda with Hortonworks HDP and Ambari.

You can install the custom management pack directly from the HDP node running the Ambari server using a URL provided by Anaconda Repository. Alternatively, the anaconda_hdp-mpack-1.0.0.tar.gz file can be manually downloaded and transferred to the Hortonworks HDP cluster for installation.

Now we’re ready to install the newly created custom Anaconda management pack using Ambari.

Installing Custom Anaconda Management Packs Using Ambari

Now that we’ve generated a custom Anaconda management pack, we can install it on our Hortonworks HDP cluster and make it available to all of the HDP cluster users for PySpark and SparkR jobs.

The management pack can be installed into Ambari by using the following command on the machine running the Ambari server.

# ambari-server install-mpack

--mpack=https://54.211.228.253:8080/anaconda/installers/anaconda/download/1.0.0/anaconda-mpack-1.0.0.tar.gz

Using python /usr/bin/python

Installing management pack

Ambari Server 'install-mpack' completed successfully.

After installing a management pack, the Ambari server must be restarted:

# ambari-server restart

After the Ambari server restarts, navigate to the Ambari Cluster Dashboard UI in a browser.

Scroll down to the bottom of the list of services on the left sidebar, then click on the Actions > Add Services button.

This will open the Add Service Wizard.

In the Add Service Wizard, you can scroll down in the list of services until you see the name of the custom Anaconda management pack that you installed. Select the custom Anaconda management pack and click the Next button.

On the Assign Slaves and Clients screen, select the Client checkbox for each HDP node that you want to install the custom Anaconda management pack onto, then click the Next button.

On the Review screen, review the proposed configuration changes, then click the Deploy button.

Over the next few minutes, the custom Anaconda management pack will be distributed and installed across the HDP cluster.

And you’re done! The custom Anaconda management pack has installed Anaconda in /opt/continuum/anaconda on each HDP node that you selected, and Anaconda is active and ready to be used by Spark or other distributed frameworks across your Hortonworks HDP cluster.

Refer to the Ambari documentation for more information about using Ambari server with management packs, and refer to the HDP documentation for more information about using and administering your Hortonworks HDP cluster with Ambari.

Using the Custom Anaconda Management Pack with spark-submit

Now that we’ve generated and installed the custom Anaconda management pack, we can use libraries from Anaconda with Spark, PySpark, SparkR or other distributed frameworks.

You can use the spark-submit command along with the PYSPARK_PYTHON environment variable to run Spark jobs that use libraries from Anaconda across the HDP cluster, for example:

$ PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py

Using the Custom Anaconda Management Pack with Jupyter

To work with Spark jobs interactively on the Hortonworks HDP cluster, you can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaborative features for your data science team and integration with enterprise authentication. Refer to our previous blog post on Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster for more information about configuring Jupyter with PySpark.

Using the Custom Anaconda Management Pack with Zeppelin

You can also use Anaconda with Zeppelin on your HDP cluster. In HDP 2.5 and Zeppelin 0.6, you’ll need to configure Zeppelin to point to the custom version of Anaconda installed on the HDP cluster by navigating to Zeppelin Notebook > Configs > Advanced zeppelin-env in the Ambari Cluster Dashboard UI in your browser.

Scroll down to the zeppelin_env_content property, uncomment, and set the following line to match the location of the Anaconda on your HDP cluster nodes:

export PYSPARK_PYTHON="/opt/continuum/anaconda/bin/python"

Then restart the Zeppelin service when prompted.

You should also configure the zeppelin.pyspark.python property in the Zeppelin PySpark interpreter to point to Anaconda (/opt/continuum/anaconda/bin/python).

Then restart the Zeppelin interpreter when prompted. Note that the PySpark interpreter configuration process will be improved and centralized in Zeppelin in a future version.

Once you’ve configured Zeppelin to point to the location of Anaconda on your HDP cluster, data scientists can run interactive Zeppelin notebooks with Anaconda and use all of the data science libraries they know and love in Anaconda with their PySpark and SparkR jobs:

Get Started with Custom Anaconda Management Packs for Hortonworks in Your Enterprise

If you’re interested in generating custom Anaconda management packs for Hortonworks HDP and Ambari to empower your data science team, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.

If you’d like to test-drive the enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, please contact us at [email protected].

You Might Also Be Interested In

Talk to an Expert

Talk to one of our experts to find solutions for your AI journey.

Talk to an Expert

Self-Service Open Data Science: Custom Anaconda Management Packs for Hortonworks HDP and Apache Ambari

Open Sourcing Anaconda Accelerate

Continuum Analytics Named a 2017 Gartner Cool Vendor in Data Science and Machine Learning

Using Anaconda to Embrace Python 3 And Support Python 2

Talk to an Expert