Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases. Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing.

Overview

Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases.

Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing.

 

 

Conclusion

In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.

You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$ conda install anaconda-client

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at [email protected]. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5.

If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN.


About the Authors

Kristopher Overholt

Product Manager

Kristopher works with DevOps, product engineering and platform infrastructure for the Anaconda Enterprise data science platform. His areas of expertise include distributed systems, data engineering and computational science workflows.

Kr …

Read more


Q. What is your superpower(s)?

A. Analytics, Developer

Q. What is your technical specialty or area of research?

A. Data Science, Software Development and DevOps for Big Data to …

Read more

Join the Disucssion