tl; dr: We discuss how data scientists working with Python, R, or both can benefit from using conda in their workflow.

Conda is a package and environment manager that can help data scientists manage their project dependencies and easily share environments with their peers. Conda works with Linux, OSX, and Windows, and is language agnostic, which allows us to use it with any programming language or even multi-language projects.

This post explores how to use conda in a multi-language data science project. We’ll use a project named topik, which combines Python and R libraries, as an example.

Reproducibility, Data Science, and Programming Languages

Data scientists need to easily reproduce their analysis and share their findings with others. The field is evolving rapidly and new programming languages, tools, and frameworks are becoming available to solve challenges. Although Python and R lead the data science programming languages ranking, others like Julia, Lua, Scala, or Clojure are also entering the field. One of the struggles many data scientists face is managing reproducibility in this complex multi-language scenario.

Conda simplifies the management of projects by allowing data scientists to:

  • Manage project dependencies, including programming languages and libraries
  • Isolate development and production environments with channels
  • Share environments with a minimal footprint
  • Support multiple versions of languages, storage systems, and packages
  • Provide a common interface for building, installing, and sharing packages

Introduction to conda

Conda is a cross-platform tool for managing packages and environments. Packages are libraries, tools, scripts, and configuration that provide functionality. Anything that you need to share: from R to pandas, mongodb or Spark to different versions of Python. Environments are collections of those packages that interact with each other to provide a context to execute your code. Both packages and environments can be packaged and shared across all three major operating systems with conda.

Conda vs Anaconda vs Miniconda

Anaconda is a free enterprise-ready Python distribution that includes 150 installed of the most popular Python packages for science, math, engineering and data analysis. Anaconda comes with conda to manage libraries and environments. Another additional 250 packages can be installed with the conda install command.

Miniconda is an lighter alternative to Anaconda. Miniconda comes just with Python and conda. It is an ideal option for users who want to start with a minimal installation and be able to manage their own set of packages.

Conda vs pip

Python programmers are probably familiar with pip to download packages from PyPI and manage their requirements. Although, both conda and pip are package managers, they are very different:

  • Pip is specific for Python packages and conda is language-agnostic, which means we can use conda to manage packages from any language
  • Pip compiles from source and conda installs binaries, removing the burden of compilation
  • Conda creates language-agnostic environments natively whereas pip relies on virtualenv to manage only Python environments

Though it is recommended to always use conda packages, conda also includes pip, so you don’t have to choose between the two. For example, to install a python package that does not have a conda package, but is available through pip, just run, for example:

conda install pip
pip install gensim

 

Bash

Installing packages

As we have mentioned earlier, conda allows us to install any binary package from a specific version of Python, to a data store, a Python library, or even an R library. For example:

conda install python=2.7 mongodb gensim bokeh=0.8 r r-data.table

 

Bash

To configure your conda installation to download from specific channels, like the R channel, you’ll need to add it to your configuration:

conda config --add channels r

 

Bash

A reproducible multi-language data science project

Topik is a topic modeling project that uses gensim, a Python library for modeling, and ldavis, an R package for visualizing the results.

Goals and summary steps

In this post, we’ll learn how to use conda env in our workflow to be able to share our code and all of its dependencies with our team members, by just doing:

git clone https://github.com/ContinuumIO/topik.git
cd topik
conda env create
source activate topik

 

Bash

Note: For Windows users, change the last command for:

activate topik

 

Bash

The three steps to follow to make it possible are:

  • Build and share conda packages of our dependencies
  • Upload those packages to a repository
  • Create an environment

Building conda Python packages with conda skeleton

To build a conda package we need a conda recipe, which consists of a meta.yml file, abuild.sh (Linux, OSX) and bat.sh (Windows). If the package is already available on PyPI, we can use conda skeleton to help us create them, for example, for the required elasticsearch Python package, run:

conda skeleton pypi elasticsearch

 

Bash

This generates a meta.yml, build.sh and bat.sh files under a folder named elasticsearch, see them here.

Once those files exist, we can build the conda package by running:

conda build elasticsearch/

 

Bash

We then have a conda package for the PyPI library elasticsearch for our platform, for example in OSX-64: ~/anaconda/conda-bld/osx-64/elasticsearch-1.5.0-py27_0.tar.bz2.

We can either share the file and install it locally:

conda install elasticsearch --use-local

 

Bash

or upload the file to binstar, a repository for conda packages, as described on the section “Sharing packages”.

We can also easily create packages for all other platforms by running:

conda convert -p all ~/anaconda/conda-bld/osx-64/elasticsearch-1.5.0-py27_0.tar.bz2 -o outputdir/

 

Bash

For more information on building conda packages, visit the conda build reference guide and tutorials. There are also many examples available in the conda-recipes repository.

Building a conda R package

We can also use conda skeleton to build conda R packages that are already available onCRAN. In our topik project example we can create a conda R package for the ldavis CRANpackage with:

conda skeleton cran ldavis

 

Bash

This generates again the three files. Note that even though the R package is called ldavis, the conda package is called r-ldavis to avoid name conflicts with packages from multiple languages and similar names.

We then run:

conda build r-ldavis/

 

Bash

and get the package: ~/anaconda/conda-bld/osx-64/r-ldavis-0.2-0.tar.bz2.

Sharing packages

Once we have those conda packages, we will need to make them available to users. Some package management systems, like PyPI or CRAN, have globally unique project namespace to share libraries. Anaconda.org offers a user-centric platform to share binary files. Instead of a single global repository namespace, users have their own namespace to upload their packages and share them, allowing private repositories and channel management to isolate development and production environments.

To upload the previous two conda packages to my Anaconda.org account, I can run:

binstar upload ~/anaconda/conda-bld/osx-64/elasticsearch-1.5.0-py27_0.tar.bz2

 

Bash
binstar upload ~/anaconda/conda-bld/osx-64/r-ldavis-0.2-r2.10_0.tar.bz2

 

Bash

Now those packages are available in my Anaconda.org channel https://anaconda.org/chdoig and everyone can easily download them by running:

conda install -c chdoig elasticsearch r-ldavis

 

Bash

Learn more about Anaconda here.

Managing environments

In a data science project it is crucial to manage library dependencies and versions to ensure reproducibility and conda env allows us to easily do so. Here’s an example taken from topik:

name: topik
channels:
  - r
  - chdoig
dependencies:
  - blaze
  - bokeh
  - scipy
  - numpy
  - pandas
  - r
  - r-ldavis
  - r-matrix
  - r-data.table
  - bokeh
  - pip
  - pip:
    - pattern
    - gensim
    - textblob
    - ijson
    - click
    - solrpy
    - elasticsearch

 

Bash

To create an enviroment from an environment.yml file, run the following in the folder where that file exists:

conda env create

 

Bash

96

This fetches all of the listed dependencies and any of their dependencies (even those not explicitly included). Conda handles all of this additional dependency resolution for us.

With a fully created conda environment, we can now activate it. This will make everything available and ready to use by our code. We can do that by running:

source activate topik

 

Bash

Note: Windows users, remember to change the last command for activate topik.

If we want to save a complete listing (or frozen snapshot) of all packages in the current conda environment, we can run:

conda env export > freeze.yml

 

Bash

To reproduce that exact environment on another machine, copy the file over there and run:

conda env create -f freeze.yml

 

Bash

Conclusion

To summarize, data scientists can use the presented workflow to share their environments and make sure that users and developers get all the required dependencies with the right versions. Data scientists just need to add the environment.yml file to their project and share the following instructions with users:

git clone https://github.com/ContinuumIO/topik.git
cd topik
conda env create
source activate topik

 

Bash

Note: Windows users, remember to change the last command for activate topik.

Learn More

To learn more about conda for data science, check out my PyData talk on “Reproducible Multi-language Data Science with Conda” and slides.

See a video explanation below on conda vs Anaconda:


About the Author

Christine Doig

Sr. Data Scientist, Product Manager

Christine is a Senior Data Scientist at Anaconda. She has over five years’ experience in analytics, operations research and machine learning in a variety of industries, including energy, manufacturing and banking. At Anaconda, she worked …

Read more

Join the Disucssion