tl; dr: We discuss how data scientists working with Python, R, or both can benefit from using conda in their workflow.
Conda is a package and environment manager that can help data scientists manage their project dependencies and easily share environments with their peers. Conda works with Linux, OSX, and Windows, and is language agnostic, which allows us to use it with any programming language or even multi-language projects.
This post explores how to use conda in a multi-language data science project. We’ll use a project named topik, which combines Python and R libraries, as an example.
Reproducibility, Data Science, and Programming Languages
Data scientists need to easily reproduce their analysis and share their findings with others. The field is evolving rapidly and new programming languages, tools, and frameworks are becoming available to solve challenges. Although Python and R lead the data science programming languages ranking, others like Julia, Lua, Scala, or Clojure are also entering the field. One of the struggles many data scientists face is managing reproducibility in this complex multi-language scenario.
Conda simplifies the management of projects by allowing data scientists to:
- Manage project dependencies, including programming languages and libraries
- Isolate development and production environments with channels
- Share environments with a minimal footprint
- Support multiple versions of languages, storage systems, and packages
- Provide a common interface for building, installing, and sharing packages
Introduction to conda
Conda is a cross-platform tool for managing packages and environments. Packages are libraries, tools, scripts, and configuration that provide functionality. Anything that you need to share: from R to pandas, mongodb or Spark to different versions of Python. Environments are collections of those packages that interact with each other to provide a context to execute your code. Both packages and environments can be packaged and shared across all three major operating systems with conda.
Conda vs Anaconda vs Miniconda
Anaconda is a free enterprise-ready Python distribution that includes 150 installed of the most popular Python packages for science, math, engineering and data analysis. Anaconda comes with conda to manage libraries and environments. Another additional 250 packages can be installed with the
conda install command.
Miniconda is an lighter alternative to Anaconda. Miniconda comes just with Python and conda. It is an ideal option for users who want to start with a minimal installation and be able to manage their own set of packages.
Conda vs pip
Python programmers are probably familiar with pip to download packages from PyPI and manage their requirements. Although, both conda and pip are package managers, they are very different:
- Pip is specific for Python packages and conda is language-agnostic, which means we can use conda to manage packages from any language
- Pip compiles from source and conda installs binaries, removing the burden of compilation
- Conda creates language-agnostic environments natively whereas pip relies on virtualenv to manage only Python environments
Though it is recommended to always use conda packages, conda also includes pip, so you don’t have to choose between the two. For example, to install a python package that does not have a conda package, but is available through pip, just run, for example:
As we have mentioned earlier, conda allows us to install any binary package from a specific version of Python, to a data store, a Python library, or even an R library. For example:
To configure your conda installation to download from specific channels, like the R channel, you’ll need to add it to your configuration:
A reproducible multi-language data science project
Goals and summary steps
In this post, we’ll learn how to use
conda env in our workflow to be able to share our code and all of its dependencies with our team members, by just doing:
Note: For Windows users, change the last command for:
The three steps to follow to make it possible are:
- Build and share conda packages of our dependencies
- Upload those packages to a repository
- Create an environment
Building conda Python packages with
To build a conda package we need a conda recipe, which consists of a
meta.yml file, a
build.sh (Linux, OSX) and
bat.sh (Windows). If the package is already available on PyPI, we can use
conda skeleton to help us create them, for example, for the required elasticsearch Python package, run:
This generates a
bat.sh files under a folder named elasticsearch, see them here.
Once those files exist, we can build the conda package by running:
We then have a conda package for the PyPI library elasticsearch for our platform, for example in OSX-64:
We can either share the file and install it locally:
or upload the file to binstar, a repository for conda packages, as described on the section “Sharing packages”.
We can also easily create packages for all other platforms by running:
Building a conda R package
We can also use
conda skeleton to build conda R packages that are already available onCRAN. In our topik project example we can create a conda R package for the ldavis CRANpackage with:
This generates again the three files. Note that even though the R package is called ldavis, the conda package is called r-ldavis to avoid name conflicts with packages from multiple languages and similar names.
We then run:
and get the package:
Once we have those conda packages, we will need to make them available to users. Some package management systems, like PyPI or CRAN, have globally unique project namespace to share libraries. Anaconda.org offers a user-centric platform to share binary files. Instead of a single global repository namespace, users have their own namespace to upload their packages and share them, allowing private repositories and channel management to isolate development and production environments.
To upload the previous two conda packages to my Anaconda.org account, I can run:
Now those packages are available in my Anaconda.org channel https://anaconda.org/chdoig and everyone can easily download them by running:
Learn more about Anaconda here.
To create an enviroment from an
environment.yml file, run the following in the folder where that file exists:
This fetches all of the listed dependencies and any of their dependencies (even those not explicitly included). Conda handles all of this additional dependency resolution for us.
With a fully created conda environment, we can now activate it. This will make everything available and ready to use by our code. We can do that by running:
Note: Windows users, remember to change the last command for
If we want to save a complete listing (or frozen snapshot) of all packages in the current conda environment, we can run:
To reproduce that exact environment on another machine, copy the file over there and run:
To summarize, data scientists can use the presented workflow to share their environments and make sure that users and developers get all the required dependencies with the right versions. Data scientists just need to add the
environment.yml file to their project and share the following instructions with users:
Note: Windows users, remember to change the last command for
To learn more about conda for data science, check out my PyData talk on “Reproducible Multi-language Data Science with Conda” and slides.