In last week’s post on the 1000 Genomes Project, we used Amazon’s Web Service (AWS) Elastic Compute 2 (EC2) — an hourly pay as you go computing service — for our MapReduce computation. EC2 instances come in a variety ofsizes with increasing computing power, memory size, ephemeral storage, and cost.

What this means is that you and I can create any sized cluster with minimal overhead and pay only for the computing we need! It’s great. I’ve launched clusters from my home, office, a truck-stop, and even flying at 35,000 feet.

Anaconda EC2

Each EC2 instance uses an AMI (Amazon Machine Instance). AMIs are pre-configured Operating Systems which often come preloaded with installed packages. For instance, if we wanted a LAMP stack running the latest 64-bit Ubuntu, that is available. At Continuum, we released an AMI with Anaconda pre-installed. AMIs are labeled using an unique identifier; our custom AMI is labeled ami-39298750. You can easily launch an Anaconda AMI using the following steps:

  • Log in to AWS.
  • Go to EC2 Dashboard.
  • Click Launch Instance.
  • Select Classic Wizard, then click Continue.
  • Select Community AMIs tab.
  • Search for “ami-39298750”, then click Select.
  • Continue with remaining AMI launch steps.

StarCluster

Launching the Anaconda AMI by hand is a great way to get acquainted with Anaconda and Disco. When you need to launch a full cluster of Anaconda nodes, using the AWS dashboard requires a lot of manual configuration to properly setup the Disco cluster.

StarCluster is an Open Source cluster management tool to help you easily spin up machines and will easily configure the Disco MapReduce Framework.

Collect your AWS information and download Continuum’s Anaconda EC2 Config and Plugin files.

Note: No quotes are used for the credentials

SSH Key Generation

SSH for EC2 instances uses SSH KEYS. Put simply, two keys are generated: one public and one private. The public key is delivered to EC2 and the private key is stored on your machine. No password is required when using SSH because authentication is based on the key.

Luckily, StarCluster makes moving and generating keys trivial. The Anaconda StarCluster config is specifically setup for a key named anacondakey. You are welcome to change the name, but if you are new to StarCluster and SSH KEYS please issue the following command exactly:

starcluster createkey anacondakey -o ~/.ssh/anacondakey.rsa

The above command creates a private key, anacondakey.rsa, and the public key is automatically sent to EC2.

File Placement

We also need to move two files which we downloaded from GitHub. Move config.anaconda to ~/.starcluster/ and rename it config. Next, move the anaconda_plugin.py file to StarCluster’s plugin directory: ~/.starcluster/plugins/

AWS INFO

Lastly, edit the recently renamed config file and insert your AWS Credentials:

  • AWS_ACCESS_KEY_ID = XXXXXXXXXXXXXX

  • AWS_SECRET_ACCESS_KEY = XXXXXXXXXXXXXXX

  • AWS_USER_ID= my-id

StartUp Command

After appropriately renaming and inserting AWS Credentials you are ready to launch Anaconda clusters:

starcluster start -s S anaconda-cluster

where “S”, is the number of nodes in your cluster.

Now that you’ve launched your cluster, you can ssh into the master or nodes:

starcluster sshmaster anaconda-cluster
starcluster sshnode anaconda-cluster node001

Launching Anaconda with our AMI automatically configures your Disco cluster as well. You can see the current Disco status by navigating to your EC2 URL (provided during startup) on port 8989.

To start 4 nodes we would use:

starcluster start -s 4 anaconda-cluster

and navigate to the Disco Status page

Conclusion

Spinning up an Anaconda cluster is almost trivial with the help of StarCluster. We are going to rely on EC2 for a number of the coming tutorials, so get your Anaconda StarCluster set up now — download our examples to try out your new cluster!

There are many EC2 configuration options not outlined in this tutorial and I encourage you to look at the StarCluster Manual.

Anaconda StarCluster Plugin and Config


About the Author

Ben Zaitlen

Data Scientist

Ben Zaitlen has been with the Anaconda Global Inc. team for over 5 years.

Read more

Join the Disucssion