To support your organization’s data analysis operations, Anaconda Enterprise enables platform users to connect to remote Apache Hadoop or Spark clusters. Anaconda Enterprise uses Apache Livy to handle session management and communication to Apache Spark clusters, including different versions of Spark, independent clusters, and even different types of Hadoop distributions. Livy provides all the authentication layers that Hadoop administrators are used to, including Kerberos. AE also authenticates to HDFS with Kerberos. Kerberos Impersonation must be enabled. When Livy is installed, users can connect to a remote Spark cluster when creating projects by selecting the Spark template. They can either use the Python libraries available on the platform, or package a specific environment to target for the job. For more information, see Hadoop / Spark. Before you begin: Verify the connection requirements. The following table outlines the supported configurations for connecting to remote Hadoop and Spark clusters with Anaconda Enterprise.Documentation Index
Fetch the complete documentation index at: https://anaconda.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
| Software | Version |
|---|---|
| Hadoop and HDFS | 2.6.0+ |
| Spark and Spark API | 1.6+ and 2.X |
| Sparkmagic | 0.12.7 |
| Livy | 0.5 |
| Hive | 1.1.0+ |
| Impala | 2.11+ |
The Hive metastore may be Postgres or MySQL. The Livy server must run on an “edge node” or client in the Hadoop/Spark cluster. Verify that the
spark-submit and/or the spark repl commands work on this machine.This example is specific to a Red Hat-based Linux distribution, with a Hadoop installation based on Cloudera CDH. To use other systems, you’ll need to look up the corresponding commands and locations.
- Locate the directory that contains Anaconda Livy. Typically this will be
anaconda-enterprise-X.X.X-X.X/installer/anaconda-livy-0.5.0, whereX.X.X-X.Xcorresponds to the Anaconda Enterprise version. - Copy the entire directory that contains Anaconda Livy to an edge node on the Spark/Hadoop cluster.