Securely Connecting Anaconda Enterprise to a Remote Spark Cluster
With the release of Anaconda Enterprise 5, we are introducing a more robust method of connecting JupyterLab, the interactive data science notebook environment, to an Apache Spark cluster. Using Anaconda Enterprise 5, you’ll be able to connect to a remote Spark cluster via Apache Livy (incubating) by using any of the available clients, including Jupyter notebooks (by using sparkmagic) and Apache Zeppelin (by using the Livy interpreter).
Over the past couple of years, we have been excited about the efforts around the Livy project because it provides the easiest way of connecting your analytics environment to a remote Spark cluster via a powerful and easy to use REST API, while leveraging all of your existing Hadoop data access and security.
When we learned that Livy was being introduced as an Apache project in June 2017, we were excited to integrate it as a component of Anaconda Enterprise 5 to provide our users and customers access to remote Spark clusters.
A simple diagram of the Anaconda Enterprise and Apache Livy architecture looks like this:
This architecture gives users the ability to submit jobs from any remote machine or analytics cluster (such as Anaconda Enterprise), even in places where a Spark client is not available, and it removes the requirement to install Jupyter and Anaconda directly on an edge node in the Spark cluster. At its core, Livy and sparkmagic work as a REST server and client that retains the interactivity and multi-language support of Spark, doesn’t require any code changes to existing Spark jobs, and maintains all of Spark’s features, such as the sharing of cached RDDs and Spark Dataframes. Finally, Livy provides an easy way of creating a secure connection to a Kerberized Spark cluster.
Connecting Anaconda Enterprise to a Remote Spark Cluster
To use the Apache Livy server in Anaconda Enterprise 5, just create a new project editor session and select the
template, which will install the
client for Jupyter in your project editor environment. Then, you create a simple configuration file pointing the
client to the Apache Livy server, and you’re done!
When you start a new project editor session, it will connect to the Spark cluster via the Livy server, which gives you full access to the Spark cluster’s data and computational resources from Anaconda Enterprise.
To help people move their data science analyses from their local machine to Anaconda Enterprise, we created conda packages for sparkmagic and its dependencies and have made them available on the community-managed conda-forge repository.
Similar to all of the foundational components in Anaconda Enterprise, the functionality to connect Anaconda Enterprise to a remote Spark cluster leverages open source technologies developed by the community and enterprise organizations. We are committed to contributing to their respective upstream projects and the ongoing process of applying and testing these tools in our enterprise customer environments.
This makes it easy to install the sparkmagic client locally, test your Spark jobs, and move them seamlessly into Anaconda Enterprise.
Using Apache Livy on Your Spark Cluster
To use this new method of connecting to a Spark cluster, Anaconda Enterprise users will need to have Apache Livy server running on their Hadoop cluster.
We make it easy to use Apache Livy with Anaconda Enterprise by creating custom Anaconda parcels for Cloudera CDH or custom Anaconda management packs for Hortonworks and Apache Ambari. We also provide Livy as a conda package so it can be installed and configured manually on the Spark cluster. The custom Anaconda parcels and Anaconda management packs make it easy to install the Anaconda Distribution along with Livy for remote cluster access using the Spark cluster management tool of your choice.
In Anaconda Enterprise 5, users can still leverage existing configuration methods to connect to a Spark cluster, such as installing Anaconda Enterprise on edge nodes of a Spark cluster.