Productionizing and Deploying Data Science Projects
Anaconda Team
13min
An end-to-end data science workflow includes stages for data preparation, exploratory analysis, predictive modeling, and sharing/dissemination of the results. At the final stages of the workflow, or even during intermediate stages, data scientists within an organization need to be able to deploy and share the results of their work for other users (both internal analysts and external customers) to consume.
In this blog post, we’ll focus on the stage of the data science workflow that comes after developing an application: productionizing and deploying data science projects and applications.
An end-to-end data science workflow includes stages for data preparation, exploratory analysis, predictive modeling, and sharing/dissemination of the results. At the final stages of the workflow, or even during intermediate stages, data scientists within an organization need to be able to deploy and share the results of their work for other users (both internal analysts and external customers) to consume.
When developing a data science project or analysis, different types of outputs and assets that can be generated during a data science workflow can include:
- Reports for sharing and disseminating information through a combination of computational results, visualizations, and data narratives
- Interactive visualizations and dashboards for self-service exploration with an organization
- Endpoints for other developers or end-users to be able to query and consume data and/or model predictions
- Functionality or libraries that feed into other tasks in the data science workflow
In previous webinars, blog posts, and tutorials, we’ve discussed how to make use of Open Data Science tools to build rich, interactive dashboards and applications:
- Supercharge Your Data Science Team: 3 Interactive Dashboards You Can Build Before The New Year
- Predict. Share. Deploy. With Open Data Science: Use Anaconda to Build, Publish, and Operationalize Predictive Models
- Hassle Free Data Science Apps: Build Richly Interactive Visualizations on Streaming & Big Data with Open Source
- Bokeh Rich Dashboards
- Building Python Data Apps with Blaze and Bokeh
In addition to the above resources, there is no shortage of information in books, blog posts, tutorials, webinars, etc. related to developing data science applications that implement machine learning models, natural language processing techniques, engineering and scientific workflows, image processing pipelines, or other types of analyses.
In this blog post, we’ll focus on the stage of the data science workflow that comes after developing an application: productionizing and deploying data science projects and applications. We’ll discuss best practices, recommended tools, and common workflows based on our experience working with our customers on the Anaconda Enterprise platform, as well as custom consulting and training solutions.
Productionizing Data Science Projects
First, we’ll consider what it means to productionize data science code. Data scientists, business analysts, and developers often work on their own laptop or desktop machines during the initial stages of the data science workflow. At some intermediate or later stages in their workflow, they want to encapsulate and deploy a portion or all of their analysis in the form of libraries, applications, dashboards, or API endpoints that other members of the data science team can leverage to further extend, disseminate, or collaborate on their results.
The process of productionizing data science assets can mean different workflows for different roles or organizations, and it depends on the asset that they want to productionize. Getting code ready for production usually involves code cleanup, profiling, optimization, testing, refactoring, and reorganizing the code into modular scripts or libraries that can be reused in other notebooks, models, or applications.
Once the code is ready for production, a number of new considerations must be made to ensure that the deployed projects and applications are robust, performant, reliably accessible, secure, and scalable, among other factors.
Let’s consider an example use case in the deployment stage of the data science workflow. A data scientist has created an end-to-end analysis in an interactive notebook environment that imports and cleans data, then classifies the data using various clustering algorithms. Now, they want to transform this analysis into a report for business analysts that gets updated daily, as well as an interactive web application for all of the users within their organization to interact with.
Another example use case for data science deployments is one that involves a data engineer who wants to transform a notebook used for exploratory analysis into a library that can be used by other data scientists and developers within the organization as well as a web application and REST API for data scientists to export and consume the post-processed, cleaned data in their own analyses. In the following section, we’ll explore the details of these data science deployment use cases, describe different types of data science assets, and describe how to deploy and scale data science projects beyond the scope of the original developer’s machine.
Deploying Data Science Projects
When starting an analysis from a Python script or Jupyter notebook, there are many different approaches that can be used to transform this code into an asset that can be leveraged and consumed by many different users and roles within a data science team. Depending on the desired output, different types of data science assets can include:
- Reports that can be deployed as hosted, static notebooks
- Code that can be encapsulated inside of a package and shared for reusability
- Dashboards and applications that can be deployed and used across an organization
- Machine learning models that can be embedded in web applications or queried via REST APIs
Hosted, Static Notebooks
Jupyter Notebooks allow data scientists and developers to work in a rich, interactive environment that combines code, narrative, and visualizations with access to all of the Open Data Science functionality available in Anaconda.
Anaconda Enterprise Notebooks builds on top of the Jupyter ecosystem and provides enterprise authentication, project management, and secure collaboration to your enterprise data science workflows.
Once you’ve created a notebook and run an analysis, you can upload a static version of the rendered notebook to services such as Anaconda Cloud, Jupyter nbviewer, or Github. You can also publish notebooks and projects within your organization using an on-premise installation of Anaconda Repository.
These components and functionality allow end-users to view the resulting output and visualizations (both static and interactive) without requiring access to the compute resources or data sources that were used in the original analysis.
You can also typically version your notebooks within these services or a version control system, such as GIt or Subversion, so that you can track revisions and history as your data science analysis evolves. Jupyter Notebooks are stored in JSON/text format and can also be exported as Python, HTML, Markdown, reStructuredText, or PDF and shared in their desired format.
The hosted, static notebook deployment scenario only requires basic considerations around productization, since the above hosted components handle most of the availability, scalability, and security issues around deployment.
One limitation of this approach is that the hosted, rendered notebooks are not executable or interactive (aside from the interactive visualization of static data) and do not have notebook kernels or Python processes attached to them, so they need to be updated and re-uploaded manually as needed.
Reusable Libraries and Functions
Data scientists typically want to take analysis code that’s been developed in a notebook during exploratory stages and move it to production to be inserted or reused in other components within a data science project.
Several approaches can be used to productionize existing code, including packaging the code in a library or wrapping it with a REST API endpoint. The latter approach is typically used when embedding the functionality in web applications and has the benefit of working across many different programming languages and web frameworks.
In the first approach, a developer can create a Python, R, or other library, build the library as a conda package, then upload the library to an on-premise instance of Anaconda Repository and share it within their organization with access control and revision history. The library can then be installed and reused by other Anaconda users and developers within an organization. The library can continue to be iteratively developed and updated while the changes and version history are tracked in Anaconda Repository.
The second approach is to create a Python script that can be deployed with an API endpoint that serves cleaned data, model predictions, or other post-processed output or results. A more detailed example of this approach is discussed later in the “Machine Learning Models with REST APIs” section.
Dashboards
Dashboards have become a popular way for data scientists to deploy and share the results of their exploratory analysis in a way that can be consumed by a larger group of end-users within their organization. Dashboards are useful to better understand trends in the data during exploratory analyses, to visually summarize information when presenting project conclusions, or for reporting and monitoring business metrics on an ongoing basis.
There are a number of different ways to develop and deploy dashboards. One of the most popular development environments that our customers and users employ in their own data science workflows is Jupyter Notebooks. There is ongoing work in the Jupyter ecosystem to extend the functionality of notebooks for use cases related to dashboard construction and deployment.
Jupyter Dashboards is an incubating project within the Jupyter ecosystem that supports interactive development workflows and the easy deployment of dashboards. Jupyter Dashboards include functionality for interactively designing grid-like or report-like layouts, creating bundled notebooks and assets, and serving dashboards as web applications, all from the familiar Jupyter Notebook interface.
The components involved in Jupyter Dashboards include a drag-and-drop dashboard layout extension, a notebook-to-dashboard converter/bundler extension, and a dashboard server to host the bundled dashboards and assets.
To install, configure, and enable the components required to deploy a Jupyter Dashboard, the following Jupyter Notebook extensions and dashboard server should be installed both on the machine running the Jupyter Notebook and on a deployment server that will host the deployed dashboards:
- https://github.com/jupyter-incubator/dashboards
- https://github.com/jupyter-incubator/dashboards_bundlers
- https://github.com/jupyter-incubator/dashboards_server
- https://github.com/jupyter/kernel_gateway
These components can be installed manually using the documentation linked above, or they can be installed using the Jupyter Notebook functionality included with Anaconda Scale, which is available as part of Anaconda Enterprise subscriptions.
Once you’ve installed the necessary components and created the layout for a dashboard in a Jupyter Notebook, you can deploy the dashboard by selecting File > Deploy As > Dashboard on Jupyter Dashboards Server from the Jupyter menu. The Jupyter Dashboards documentation provides more detail on installing, configuring, and deploying dashboards.
Interactive Applications
When considering how to productionize and deploy data science projects and assets, interactive applications are a powerful and flexible method to create and share custom functionality. These applications can be used by other members of your data science team to view, consume, and interact with the results of data science analyses. Interactive web applications typically extend beyond the scope of dashboards and allow end-users to explore data in more detail, select and modify modeling algorithms, or export data for further analysis.
Bokeh is a powerful interactive visualization library with bindings for Python, R, Scala, and Lua that can also be used to develop and deploy interactive, real-time applications without needing to use complex web or design frameworks. The resources provided in the introduction of this blog post provide more details about how to build and construct interactive applications with Bokeh.
In contrast to hosted, static notebooks or an interactive Bokeh plot that displays static data, an interactive Bokeh application includes a server-side component and active processes that allows users to interact with the application, models, and data via widgets such as sliders, drop-down menus, text fields, data selectors, and others.
You can view more examples of interactive Bokeh applications, including applications to interactively explore weather statistics, stock market data, linked histograms/distributions, and more.
Once you’ve developed an interactive application with Bokeh, you can deploy and make it available to other users within your organization. The Bokeh documentation provides more detail on multiple deployment scenarios and configurations, including using a standalone Bokeh server, SSH tunneling, reverse proxy server, load balancing, and supervised processes.
When deploying applications with a small number of users and minimal computational overhead, the standalone Bokeh server is typically adequate. When deploying more complex and resource-intensive applications, Bokeh is commonly used with NGINX for reverse proxying and load balancing functionality. We also commonly use the Supervisord process control system to manage the necessary services for an application. Conda Kapsel is a newer project in the conda ecosystem that allows you to define, manage, and run reproducible, executable projects and their dependencies, including conda environments, data files, and services.
Machine Learning Models with REST APIs
After developing exploratory analyses in scripts and notebooks, data scientists often want to serve output from scripts and applications in the form of REST APIs. Other developers or data scientists can then build additional layers of visualizations, dashboards, or web applications that consume data from the API endpoints (e.g., cleaned data or trained machine learning models) and drive progressive stages of a data science pipeline.
Depending on the desired output and the stage in the data science workflow, a REST API can be implemented and configured within a Python application to serve different types of data, including numeric data for analytics and metrics, text data from database records or various data sources, or numerical predictions from machine learning model
Python connects to a rich set of machine learning and deep learning libraries such as scikit-learn, Theano, Keras, H2O, and Tensorflow as well as provides a wide range of web frameworks such as Django (with the Django REST framework) and Flask (with Flask-RESTful). The combination of machine learning libraries, web frameworks, and other functionality make Python a great language to use when deploying data science projects because you can handle different aspects of the application development and deployment in a single language.
An example use case of deploying a data science application that involves machine learning and API endpoints is described in a blog post on deploying a scikit-learn classifier to production. In that example, a trained version of a machine learning classifier that was implemented using scikit-learn is deployed to a remote server with a REST API that can be queried for model predictions.
The tools used in that particular workflow included Joblib (for serializing the model) and Flask (for wrapping the classifier in an API). The trained classifier model and API was then deployed to a remote, cloud-hosted server using Anaconda (for environment and dependency management) and Fabric (for configuration management).
Considerations When Deploying Data Science Projects
The above sections provided some implementation-specific information and tools that can be used to deploy data science assets such as notebooks, dashboards, or interactive applications in different scenarios, including hosted notebook and dashboard servers, RESTful API frameworks, and process management. There are additional aspects that should be considered when deploying any type of data science asset or project within your organization.
When defining a process for data scientists and users to productionize and deploy their own custom applications, notebooks, or dashboards within your organization, you’ll also want to consider the general production details described in the following sections to ensure that the deployed applications and compute infrastructure are robust, stable, secure, and scalable, among other considerations.
Provisioning Compute Resources
Before you deploy your data science project, you’ll need to reserve and allocate compute resources from a cloud-based provider (Amazon AWS, Google Cloud Platform, or Microsoft Azure) or one or more bare-metal machines within your organization that will act as deployment servers.
Depending on factors such as the computational resources (CPU, memory, disk) required for your applications, number of concurrent users, and expected number of deployed applications, your use case might require one or more deployment servers in a cluster.
Once you’ve identified compute resources for your use case, you’ll need to install and configure the production environments on the machines, including system-wide configuration, user management, network and security settings, and the system libraries required for your applications. Some enterprise-grade configuration management tools that we commonly use to manage deployment servers and machines within cluster environments include Fabric, Salt, Ansible, Chef, and Puppet.
Managing Dependencies and Environments
When you deploy a data science application, you’ll want to ensure that it’s running in an environment that has the appropriate version of Python, R, and libraries that your application depends on, including numerical, visualization, machine learning, and other data science packages and their corresponding C/C++/Fortran/etc. libraries.
This problem is easy to solve using Anaconda with multiple conda environments, which are separate from the system/framework installation of Python on your deployment servers and can be used to manage as many Python and R environments and package versions as your applications need. Conda Kapsel is a newer project in the conda ecosystem that allows you to define, manage, and run reproducible, executable projects and their dependencies, including conda environments, data files, and services.
Containerization and virtualization management layers such as Docker or Vagrant can also optionally be used with Anaconda to improve the portability of your application in various environments. Anaconda makes this aspect of data science deployment easy by integrating with various cloud providers, containerization, and virtualization technologies.
Ensuring Availability, Uptime, and Monitoring Status
Once you’ve deployed your data science application, you’ll need to ensure that your application’s runtime environment and processes are robust and reliably available for end-users. You might also want to set up log aggregation, uptime monitoring, or logging alert systems such as Sentry or Elasticsearch/Logstash.
Some operational details to consider are: how will you troubleshoot and inspect long-running applications, how will you monitor the load and demand of the deployment servers, how will you ensure compute resources are being reasonably shared among users and applications, how will you handle application failures, and how will you ensure that an application you deployed will be available in a few months?
Engineering for Scalability
Before you deploy your data science project, you’ll need to estimate the scalability limits of the computational load and overhead of the applications. Do you expect 10, 100, or 1000 users to access your application concurrently? Is your application optimized for performance and able to leverage accelerated functions of your hardware, including GPUs?
To allow for the scalability of demanding applications or large concurrent usage, you might need to implement and configure load balancing and reverse proxy functionality in your web application servers such as NGINX and Gunicorn so that your application is responsive and scalable under heavy load and peak usage conditions.
Once your data science applications are deployed, they typically undergo a continuous iterative development/devops cycle to add functionality or improve the performance of queries or data access as the application matures. In a collaborative environment with one or more data science or development teams, you’ll need to track changes and versions of the deployed applications and their dependencies to avoid application errors or dependency conflicts.
Sharing Compute Resources
When multiple users in your organization are running exploratory analyses, sharing and collaborating within notebooks, and deploying various data science applications, you’ll need to ensure that the compute resources (CPU, RAM, disk) on your cluster can be reasonably shared between users and applications.
This can be accomplished using resource managers (e.g., Apache YARN or Apache Mesos, which are popular in Hadoop environments) or job schedulers (e.g., SGE/OGE, SLURM, or Torque, which are popular in HPC environments). These distributed service and management frameworks are typically installed and configured on a cluster by system administrators or IT and can be configured with job queues for different types of applications and needs.
Securing Data and Network Connectivity
The data science applications deployed within your organization will likely be accessing data stored in files, a database, or a distributed/online file system such as Amazon S3, NFS, HDFS, or GlusterFS in different formats such as CSV, JSON, or Parquet. You’ll need to ensure that your deployment server(s) have the appropriate network/security configuration and credentials for your applications to securely access the data and file servers without exposing your data or compute resources to risky situations.
It’s also good practice to configure your application with the ability to securely access data without exposing database or account credentials to end-users in notebooks, plain text files, or scripts under version control.
Securing Network Communications and SSL
When you deploy an application, dashboard, or notebook, you’ll likely want to utilize end-to-end encryption for your network and API communication via HTTPS to ensure that traffic between your users and the application is secure. This might involve configuring your web application servers to use SSL certificates, certificate authorities, and secure proxies as needed with the appropriate hooks and layers for your end-users applications.
Managing Authentication and Access Control
If you’re deploying a data science application to your own infrastructure and want to restrict access to a subset of users within your organization, you’ll need to implement layers of authentication and access control that can be used on a per-application or per-project basis.
This can be implemented via various HTTP authentication methods, a web framework such as Django or Flask (with various authentication backends), or a third-party authentication service/API. You might also need to integrate your deployed data science applications with your organization’s enterprise authentication mechanisms and identity management systems (e.g., LDAP, AD, or SAML).
Scheduling Regular Execution of Jobs
After you deploy your data science project, you might want to incorporate scheduled execution intervals for a notebook, dashboard, or model so that the end-users will always be viewing the most up-to-date information that incorporates the latest data and code changes.
This can be accomplished via a Cron job scheduler or via workflow/pipeline managers such as Apache Airflow (incubating) or Luigi. Using these tools, you can configure your data science application to run at regular intervals every few minutes, hours, or days to perform tasks such as data ingestion, data cleaning, updated model runs, data visualization, and saving output data.
Additional Resources for Productionizing and Deploying Data Science Projects
The productionization and deployment of data science assets such as notebooks, dashboards, interactive applications, or models involves using tools that are usually specific to a particular type of data science asset. Relatively simple deployments (single deployment server, small number of users, minimal computational requirements, and minimal/no security) are best described as self-service data science workflows can be achieved using out-of-the-box hosted notebook or dashboard server components.
The additional considerations related to more complex, secure, and scalable data science deployments outlined in this blog post are not particularly straightforward and require coordinated design, configuration, and ongoing maintenance of complex frameworks and infrastructure.
You Might Also Be Interested In
Talk to an Expert
Talk to one of our experts to find solutions for your AI journey.