Generate Custom Parcels for Cloudera CDH with Anaconda Enterprise 5
Jun 16, 2018By Anaconda Team
As part of our partnership with Cloudera, we offer a freely available Anaconda Python parcel for Cloudera CDH based on the Anaconda Distribution. The Anaconda parcel has been very well-received by both Anaconda and Cloudera users by making it easier for data scientists and analysts to use libraries from Anaconda that they know and love with Hadoop and Spark on Cloudera CDH.
Anaconda Enterprise 5 offers data scientists and administrators self-service generation of custom Anaconda parcels and installers. With this, users can deploy Anaconda with different versions of Python and custom conda packages that are not included in the freely available Anaconda parcel. Using parcels to manage multiple Anaconda installations across a Cloudera CDH cluster is convenient because it works natively with Cloudera Manager without the need to install additional software or services on the cluster nodes.
Deploying multiple custom versions of Python/R with Anaconda on a Cloudera CDH cluster with Hadoop and Spark has never been easier! Let’s take a closer look at how we can create and install a custom Anaconda parcel using Anaconda Enterprise and Cloudera Manager.
Generating Custom Anaconda Parcels
For this example, we’ve installed Anaconda Enterprise 5.1.2 and mirrored more than 1000+ conda packages from the Anaconda distribution into the on-premises repository. We’ve also installed Cloudera CDH 5.14 with Spark on a separate cluster.
First, we log into our Anaconda Enterprise instance. We then navigate to the Packages section of Anaconda Enterprise and select Advanced.
This shows us the Environments tab, where we can view our existing package environments as well as generate new ones.
To generate a parcel, we can create a new environment. We click the + button to begin building our environment.
The environment page provides an overview of how you can create custom environments. To create a new environment, we first select the anaconda channel in the Select Channels section. Then, under Select Packages, we begin choosing the packages we want to install in our custom Anaconda parcel.
In this example, we’ll create an environment with Anaconda 5.1.0. You can search for packages by name or scroll directly to the package you want. To add packages, simply select the checkbox next to each package name.
For this example, we’ve chosen all of the packages in Anaconda. We also could have included conda packages from other channels in our on-premise installation of Anaconda Enterprise, including R conda packages from MRO, community-built packages from conda-forge, or other custom-built conda packages from different users within our organization.
We name the environment “anaconda_parcel”, then click Resolve and Save to generate our new environment.
Depending on the packages selected, the build process might take a few minutes. Once completed, you should see your new environment displayed on the All Environments page.
If this is the first environment you have created, you will now see a new button - CREATE. Select CREATE and choose the appropriate drop-down (installer, parcel, or management pack). In our case, we will select Parcel. Note that this process is identical for generating custom Anaconda installers or management packs.
Once we select Create Parcel, Anaconda Enterprise will generate a custom Anaconda parcel from our environment. As before, this will take a few moments and then you should see your custom Anaconda parcel on the All Environments page. Once completed, we see a list of parcel files that were generated for all of the Linux distributions supported by Cloudera Manager.
Additionally, Anaconda Enterprise has already updated the manifest file used by Cloudera Manager with the new parcel information at the existing Remote Parcel Repository URL. Now, we’re ready to install the newly created custom Anaconda parcel using Cloudera Manager.
Installing Custom Anaconda Parcels Using Cloudera Manager
Now that we’ve generated a custom Anaconda parcel, we can install it on our Cloudera CDH cluster and make it available to all of the cluster users for PySpark and SparkR jobs.
From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
Click the Configuration button on the top right of the Parcels page.
Click the plus symbol in the Remote Parcel Repository URLs section, and add the repository URL that was provided from Anaconda Repository.
Select Distribute to install the parcel across your CDH cluster.