Enterprise Data Science
Deriving Business Value from Data Science Deployments
Nov 21, 2018By Anaconda Team
One of the biggest challenges facing organizations trying to derive value from data science and machine learning is deployment. In this post, we’ll take a look at three common approaches to deploying data science projects, and how Anaconda Enterprise simplifies deployment and allows data scientists to focus on building better models that generate business value rather than wasting time on infrastructure.
By deployment, we mean making data science assets available for use by other people or tools. Unlike your data scientist colleagues, these people don’t want to write or even look at code. Show them your Python script, R IDE, or Jupyter Notebook and they will stare back at you as though you handed them a Sanskrit text. They don’t want to see code! Instead, they just want to read your report, interact with your dashboard, or consume your REST API. Deployment is the process of making these things—dashboards, web apps, APIs—available to end users.
While deployment is conceptually simple, actually deploying stuff like runnable Jupyter Notebooks, dashboards/web applications, and REST APIs can be really tough. And while infrastructure tools like Docker and Kubernetes help tremendously, they require a significant learning curve from data scientists who want to use them.
Deploying A Model As A Rest API
Let’s assume we want to deploy a model as a REST API. Our model takes in data from our marketing database and returns a prediction as to whether a given lead is good or not. Our marketing team wants to use this model on our website to make targeted offers to prospects the model thinks are good leads. So, we decide to build a web service (REST API) that the web developers will call to determine if a lead is good or bad.
We use a web framework like Flask (for Python) or Plumber (for R) to serve our model. Now, what do we do with it? How do we actually deploy this model as a REST API that a team of developers can use?
Let’s discuss how we might deploy this API using:
- A traditional approach (provision infrastructure, moving code and dependencies)
- A Docker/Kubernetes approach
- Anaconda Enterprise (hint: it’s going to be really easy and you are guaranteed to love it)
Traditional Approach To Deployment
A traditional approach to model deployment would be to provision infrastructure and manually move our source code and dependencies to the target environment.
For example, if we are using a cloud provider, we might spin up a VM on AWS, Google Cloud, Azure, Digital Ocean, etc. This VM would serve as our infrastructure.
Then we might pull down our model’s source code from a Git server and manually install any dependencies (e.g., pandas, scikit-learn) that our model needs to run. Once we can run our model on the VM we provisioned, we might use a WSGI server to route incoming traffic to our model. In the Python world, tools like Nginx and Gunicorn are popular options for this.
Once our model is deployed, we’ll then map the IP address of our host machine to a DNS record. This way our model’s end users can call the model via a URL such as <model>.company.com rather than the IP address of the host machine. And of course, we want to generate TLS certs for our model’s domain as well.
While this approach works for deploying models, there are a number of drawbacks. As you can see, the process is quite manual. The data scientist must copy source code and dependencies between the development and deployment environments and then make sure everything works. This is not only slow but error-prone, as there is no guarantee that the code that works on the data scientist’s workstation will run on a production server. Further, the data scientist (or IT via a support ticket) must install web servers, configure DNS entries, and generate certificates, thereby increasing the amount of time required to deploy a model and making frequent model updates (and hence subsequent deployments) time-consuming.
Docker & Kubernetes For Deployment
Deploying our model as a REST API using Docker and Kubernetes eliminates many of the challenges around reproducibility and speed.
The first step is to put our model inside a Docker container. To borrow from the old Postal Service adage that “if it fits, it ships,” you can think of a Docker container as a portable box. Put your working code inside of a box and then move that box to other computers and it will work.
This makes things easier for us, as we now only have to ensure our target deployment environment has Docker installed rather than making sure it has every dependency our project needs. So, step one is to build a Docker container that runs our model. We’ll then store this container as an image in a Docker image repository (think of this as Github for Docker) so that we can easily access our Docker container from an external system.
While Docker helps with reproducibility, Kubernetes makes deploying our Docker containers easy. Kubernetes is a container orchestrator. You can just think of it as one of the cranes that move containers on and off ships at a port. Kubernetes manages lots of containers, moving them between servers, restarting them if they fail, and load-balancing traffic between (among other things).
So in our case, we install Kubernetes on a cloud provider’s infrastructure. To make things even easier, we might use a managed Kubernetes service like Google Kubernetes Engine (all the cloud providers offer something similar). Once we have installed Kubernetes, we then create a few configuration files that do things like tell Kubernetes where to find the Docker container we want Kubernetes to deploy and what port we want to serve our model on.
We then run a Kubernetes command (using the Kubernete kubectl command-line tool) to deploy our model on Kubernetes. And then we conclude our work by configuring DNS entries and certs just as we did before.
While Docker and Kubernetes make deployment faster and easier than the manual effort shown before, there is still a decent amount of work a data scientist must do to get their models deployed. And of course, one must also have Docker and Kubernetes knowledge to do so. Given the myriad tools with which data scientists must be accustomed to do their jobs, it’s not inconceivable that they can learn new tools like Docker and Kubernetes, but it’s not something that everyone wants to do.
Ideally, data scientists could focus on what they do best: analyze data in the context of pressing business concerns and build models that generate value. Rather than concern themselves with WSGI servers or kubectl commands, data scientists could focus on building better models and diving deeper into core business needs. Anaconda Enterprise is designed to help data scientists do exactly that.
Anaconda Enterprise Deployment
Contrast the approaches above with how easy Anaconda Enterprise makes deploying data science projects. With Anaconda Enterprise, data scientists can deploy anything they write with code with a single click. While most users deploy Python and R projects like live, runnable Jupyter Notebooks, interactive dashboards and web applications, and machine learning models as REST APIs, Anaconda Enterprise will deploy anything that the user serves up on localhost and port 8086.
Here’s how it looks from the data scientist’s perspective. The data scientist writes a model serving code (in our example, a Flask app, but you are only limited to your imagination). Using an Anaconda Enterprise GUI or command-line tool, or by editing a single configuration file directly, the user creates a deployment command that tells Anaconda Enterprise what the user wants to deploy. Sticking with our Flask app example, that deployment command might be something like `python app.py.
Now the user needs only to press a big green “deploy” button and let Anaconda Enterprise take care of the rest. Anaconda Enterprise will automatically pull down the project’s source code, install the appropriate dependencies, start the project, and create a dynamic URL with TLS certs. All of the boring infrastructure work is done for you.
Even better, users still benefit from the reproducibility and scalability of Docker and Kubernetes. Anaconda Enterprise uses both as core services of the platform (Anaconda Enterprise itself runs on Kubernetes). When the user clicks deploy, Anaconda Enterprise spawns a Docker container and deploys it with Kubernetes. Users don’t have to interact with Docker and Kubernetes directly but they still receive the benefits of both.
Users can create as many deployments as they want and they have full reproducibility. If you want to roll back and deploy a model from six months ago, it is as easy as selecting that revision and clicking Deploy.
As a user, I don’t have to worry about keeping multiple environments in sync, which lets me work faster and quickly deploy the latest versions of models and make them available to end users. Similarly, as an admin, I no longer have to worry about provisioning infrastructure, load-balancing, or even creating DNS entries because Anaconda Enterprise does that for me.
If you’re interested in deploying data science projects but don’t want to get bogged down in the infrastructure, reach out to Anaconda to learn more about how Anaconda Enterprise can help you.