Using Anaconda and H2O to Supercharge your Machine Learning and Predictive Analytics
Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love with the tools you use, including Amazon Web Services, Docker, and Cloudera CDH. Today we’re excited to announce our new partnership with H2O and the availability of H2O machine learning packages for Anaconda on Windows, Mac and Linux.
H2O is an open source, in-memory, distributed, fast and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data. Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. H2O is used by over 60,000 data scientists and more than 7,000 organizations around the world.
H2O includes a wide range of data science algorithms and estimators for supervised and unsupervised machine learning such as generalized linear modeling, gradient boosting, deep learning, random forest, naive bayes, ensemble learning, generalized low rank models, k-means clustering, principal component analysis, and others. H2O provides interfaces for Python, R, Java and Scala, and can be run in standalone mode or on a Hadoop/Spark cluster via Sparkling Water or sparklyr.
In this blog post, we’ll demonstrate you how you can install and use H2O with Python alongside the 720+ packages in Anaconda to perform interactive machine learning workflows with notebooks and visualizations as part of Anaconda’s Open Data Science platform.
Installing and Using H2O with Anaconda
You can install H2O with Anaconda on Windows, Mac or Linux. The following conda command will install the H2O core library and engine, the H2O Python client library and the required Java dependencies (OpenJDK):
$ conda install h2o h2o-py
That’s it! After installing H2O with Anaconda, you’re now ready to get started with a wide range of machine learning algorithms and data science modeling techniques.
In the following sections, we’ll demonstrate how to use H2O with Anaconda based on examples from the H2O documentation, including a k-means clustering example, a deep learning example and a gradient boosting example.
K-means Clustering with Anaconda and H2O
K-means clustering is an machine learning technique that can be used to classify values in a data set using a clustering algorithm.
In this example, we’ll use the k-means clustering algorithm in H2O on the Iris flower data set to classify the measurements into clusters.
First, we’ll start a Jupyter notebook server where we can run the H2O machine learning examples in an interactive notebook environment with access to all of the libraries from Anaconda.
$ jupyter notebook
In the notebook, we can import the H2O client library and initialize an H2O cluster, which will be started on our local machine:
>>> import h2o >>> h2o.init() Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_102"; OpenJDK Runtime Environment (Zulu 188.8.131.52-macosx) (build 1.8.0_102-b14); OpenJDK 64-Bit Server VM (Zulu 184.108.40.206-macosx) (build 25.102-b14, mixed mode) Starting server from /Users/koverholt/anaconda3/h2o_jar/h2o.jar Ice root: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims JVM stdout: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.out JVM stderr: /var/folders/5b/1vh3qn2x7_s7mj88zc3nms0m0000gp/T/tmpj9mo8ims/h2o_koverholt_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.
After we’ve started the H2O cluster, we can download the Iris data set from the H2O repository on Github and view a summary of the data:
>>> iris = h2o.import_file(path="https://github.com/h2oai/h2o-3/raw/master/h2o-r/h2o-package/inst/extdata/iris_wheader.csv") >>> iris.describe()
Now that we’ve loaded the data set, we can import and run the k-means estimator from H2O:
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> results = [H2OKMeansEstimator(k=clusters, init="Random", seed=2, standardize=True) for clusters in range(2,13)] >>> for estimator in results: estimator.train(x=iris.col_names[0:-1], training_frame = iris) kmeans Model Build progress: |████████████████████████████████████████████| 100%
We can specify the number of clusters and iteratively compute the cluster locations and data points that are contained within the clusters:
>>> clusters = 4 >>> predicted = results[clusters-2].predict(iris) >>> iris["Predicted"] = predicted["predict"].asfactor() kmeans prediction progress: |█████████████████████████████████████████████| 100%
Once we’ve generated the predictions, we can visualize the classified data and clusters. Because we have access to all of the libraries in Anaconda in the same notebook as H2O, we can use matplotlib and seaborn to visualize the results:
>>> import seaborn as sns >>> %matplotlib inline >>> sns.set() >>> sns.pairplot(iris.as_data_frame(True), vars=["sepal_len", "sepal_wid", "petal_len", "petal_wid"], hue="Predicted");
Deep Learning with Anaconda and H2O
We can also perform deep learning with H2O and Anaconda. Deep learning is a class of machine learning algorithms that incorporate neural networks and can be used to perform regression and classification tasks on a data set.
In this example, we’ll use the supervised deep learning algorithm in H2O on the Prostate Cancer data set stored on Amazon S3.
We’ll use the same H2O cluster that we created using h2o.init() in the previous example. First, we’ll download the Prostate Cancer data set from a publicly available Amazon S3 bucket and view a summary of the data:
>>> prostate = h2o.import_file(path="s3://h2o-public-test-data/smalldata/logreg/prostate.csv") >>> prostate.describe() Rows: 380 Cols: 9
We can then import and run the deep learning estimator from H2O on the Prostate Cancer data:
>>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator >>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor() >>> model = H2ODeepLearningEstimator(activation = "Tanh", hidden = [10, 10, 10], epochs = 10000) >>> model.train(x = list(set(prostate.columns) - set(["ID","CAPSULE"])), y ="CAPSULE", training_frame = prostate) >>> model.show() deeplearning Model Build progress: |██████████████████████████████████████| 100% Model Details ============= H2ODeepLearningEstimator : Deep Learning Model Key: DeepLearning_model_python_1483417629507_19 Status of Neuron Layers: predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 322 weights/biases, 8.5 KB, 3,800,000 training samples, mini-batch size 1
After we’ve trained the deep learning model, we can generate predictions and view the results, including the model scoring history and performance metrics:
>>> predictions = model.predict(prostate) >>> predictions.show() deeplearning prediction progress: |███████████████████████████████████████| 100%
Gradient Boosting with H2O and Anaconda
We can also perform gradient boosting with H2O and Anaconda. Gradient boosting is an ensemble machine learning technique (commonly used in conjunction with decision trees) that can perform regression and classification tasks on a data set.
In this example, we’ll use the supervised gradient boosting algorithm in H2O on a cleaned version of the Prostate Cancer data from the previous deep learning example.
First, we’ll import and run the gradient boosting estimator from H2O on the Prostate Cancer data:
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> my_gbm = H2OGradientBoostingEstimator(distribution = "bernoulli", ntrees=50, learn_rate=0.1) >>> my_gbm.train(x=list(range(1,train.ncol)), y="CAPSULE", training_frame=train, validation_frame=train) gbm Model Build progress: |███████████████████████████████████████████████| 100%
After we’ve trained the gradient boosting model, we can view the resulting model performance metrics:
>>> my_gbm_metrics = my_gbm.model_performance(train) >>> my_gbm_metrics.show() ModelMetricsBinomial: gbm ** Reported on test data. ** MSE: 0.07338612348053128 RMSE: 0.2708987328883826 LogLoss: 0.26757238912319825 Mean Per-Class Error: 0.07431401341740806 AUC: 0.9801618150931445 Gini: 0.960323630186289 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4772353333869793:
Additional Resources for Machine Learning with Anaconda and H2O
Refer to the H2O documentation for more information about the full set of machine learning algorithms, libraries and examples that are available in H2O, including generalized linear modeling, random forest, naive bayes, ensemble learning, generalized low rank models, principal component analysis and others.
Interested in using Anaconda and H2O in your enterprise organization for machine learning, model deployment workflows and scalable analysis with Hadoop and Spark? Get in touch with us if you’d like to learn more about how Anaconda can empower your enterprise with Open Data Science, including an on-premise package repository, collaborative notebooks, cluster deployments and custom consulting/training solutions.
The complete notebooks for the k-means clustering, deep learning, and gradient boosting examples shown in this blog post can be viewed and downloaded from Anaconda Cloud: