Table of Contents
Marc Andreessen famously opened an August 2011 blog article with this provocative sentence: “Software is eating the world.” His prediction was that software development would disrupt traditional industries. Indeed, companies like Airbnb, Netflix, and Uber emerged as just a few of many winners in the “on-demand” economy that disrupted industries like travel, entertainment, and shopping in significant and lasting ways.
About a year later, in October 2012, Harvard Business Review reported that data scientist was the “sexiest job of the 21st century,” promising professionals who could “coax treasure out of unstructured data.” And the race to structured data began, with organizations taking a closer look at their messy data and finding ways to make it more consumable by machines.
Fast-forward seven years to October 2019, and McKinsey Global Institute offered exciting and cautionary words about the “coming of AI spring.” Their research showed hundreds of business cases that, combined, had the potential to create between $3.5 trillion and $5.8 trillion in value annually. As organizations applied artificial intelligence, they found it could yield outsized business value. Data science capabilities emerged as a prerequisite for high-performing AI, so organizations increased their investments in technologies, data science teams, and techniques like machine learning and deep learning.
In August 2022, Stable Diffusion rocked the visual arts world with its text-to-image model built with Python and deep learning that could generate detailed images based on text prompts. It stoked the world’s fascination with AI and unleashed its next wave: generative AI.
Finally, three months later, in November 2022, OpenAI released another generative model: ChatGPT, a large language model (LLM) that uses training on OpenAI’s GPT-3 and GPT-4 LLMs to generate text based on prompts from the user. In a short time, LLMs have taken many industries by storm, with new products and capabilities that make it possible for programmers to write and debug code alongside a machine and for writers to work with AI to produce content, just to name a couple of a growing number of use cases.
Finding an Enterprise Platform
With all of this progress, it would seem that every organization should somehow incorporate these new technologies into their research, products, and operations. However, applying these techniques to their fullest potential requires a set of fully featured tools, clean and structured data, expert teams, and the power of open-source software, backed by an engaged community of makers and maintainers.
Leveraging the power of open-source software across an enterprise organization requires capabilities for building and deploying secure Python solutions. There are a burgeoning number of options available for stitching together tools that can enable teams to collaborate and build powerful applications with data science and machine learning. But bolting together tools to deploy predictive models into production is not the best way to approach creating a platform that your organization can rely on to deliver excellent outcomes. And building your own can be expensive and complex, because you’ll need to maintain the platform you create.
Finding an enterprise platform that can provide the open-source packages you need, the managed environments that allow you to reproduce and scale models in production, and the security tools to protect your organization from bad code and bad actors can be a tough challenge. That’s what this guide is all about—exploring what to consider when you are selecting an enterprise platform to use with Python and open-source software to achieve your organization’s development goals.
What makes a platform?
One popular description of a platform comes from Microsoft CEO Bill Gates, as paraphrased by Charmath Palapithiya: “A platform is when the economic value of everybody that uses it, exceeds the value of the company that creates it.” As you evaluate platforms, consider these basic characteristics that will help you leverage the innovation of the community as your teams develop and deploy applications using open source with Python:
- Number of individual users: The more users, the more opportunities there are to discover new techniques shared by others, identify and address security risks faster, and benefit from a rich community of software makers and maintainers.
- Number of enterprise users: The more enterprise users, the more the platform has been tested at scale. Number of users may be expressed as a percentage of a total group of organizations or businesses, such as the Fortune 500.
- Years of experience: The longer an organization has been working to develop their platform, the more expertise their team is likely to have across tools, techniques, and use cases.
Cross-industry customers: The more industries in which a platform has been applied, the more integrations, use cases, and data types the platform and supporting team have likely encountered.
Python, Open Source, Data Science, and the Enterprise
Data science has revolutionized the way businesses operate. Today, it seems that everyone is working with data in some capacity, whether it’s analyzing customer behavior, building predictive models, or creating generative models. As the demand for data-driven insights continues to grow, Python has emerged as the go-to language for data science work.
In fact, Python has long been the gold standard for data science work, thanks in large part to its simplicity and versatility. Unlike other languages, Python allows users to easily manipulate and analyze data, making it an ideal choice for everything from data visualization to machine learning. Additionally, the availability of numerous open-source libraries and frameworks ensures that Python remains a popular choice for data scientists.
Open-source software provides developers access to a global network of contributors who are constantly updating and improving code, making it possible for companies to create applications much faster and more efficiently than ever before. The vast majority (96%) of code bases contain open-source software, according to the Synopys 2023 OSS Risk Analysis Report.
The widespread application of OSS makes sense; open source not only helps companies save on licensing costs, but also allows them to leverage the collective knowledge of the open-source community to create customized solutions to meet their specific business needs. As a result, open source has become an essential strategic tool for organizations looking to stay ahead in the fast-paced world of technology.
However, managing Python development in enterprise organizations has become more complex and difficult over the past few years. This is due in part to the rapid pace of development within the Python community, which has led to the release of new tools and technologies on a regular basis. Some of these tools are proprietary, and some are open source. While this is ultimately a positive development for teams that work with data, it can make it challenging to keep up with the newest techniques and best practices.
Despite these challenges, Python remains one of the most powerful and versatile tools available for data science work. As the industry continues to evolve, Python will remain a critical component of any successful data science team’s toolbox.
Enterprise Python Challenges
At Anaconda, we speak with organizations around the world who are working with Python. We find that most of these teams are experiencing similar challenges, and they are attempting to solve them in similar ways.
1. Package Management and Build Environments
For busy enterprise teams, managing packages and build environments is a significant challenge. Many teams manage packages manually, which has the advantage of giving them control over each package and the customization of environments. However, this is time-consuming and error prone. It also can lead to inconsistent environments and lack of oversight for data protection and governance of resources.
Other teams use proprietary third-party package management tools, which can streamline package management and provide off-the-shelf functionality. However, these tools are not suited to Python workflows. They offer limited customization and force you to rely on vendors to build out the tool to meet your business needs.
2. Collaboration and Deployment
Project collaboration is an important part of building and scaling great models, so reproducibility is a formidable challenge, especially for large teams. Most teams do this in a fractured way, with models on individual machines, leading to the often-heard phrase among data scientists and data engineers: “It works on my machine.”
When it comes to deployment, manual processes give you more control over your pipeline but, like manual package management, are time-consuming and prone to errors and scalability issues. Building your own infrastructure for deployment allows you to customize and also gives you more control, but you may see lower return on investment due to high development and maintenance costs.
There are easy-to-use machine learning platforms with off-the-shelf functionality and some support, but these can be highly restrictive compared to open-source software, with limited customization options. They also can be quite expensive.
3. Governance and Securing the Open-Source Pipeline
A trusted source for your open-source packages has never been more important. The March 2023 National Cybersecurity Strategy and frameworks from the National Institute of Standards and Technology (NIST) show that the burden of security is shifting to organizations and individuals who develop software.
Manual security audits can help you meet minimum regulatory requirements and identify some security risks. However, they, too, are time-consuming and resource intensive, and they put your organization in a reactive position. In-house security training can increase awareness and promote good practices, but its effectiveness is limited and it is insufficient on its own.
Third-party scanning tools are often easy to use and, like some machine learning platforms, offer off-the-shelf functionality and some support. However, these tools are not suited to Python workflows, throw a high rate of false positives, and can mishandle compiled packages.
The Top Features to Look for in an Enterprise Python Platform: A Buyer’s Checklist
An enterprise platform should be flexible enough to meet your needs today and powerful enough to withstand the demands of your future workloads and projects. You can use this checklist as you evaluate enterprise platforms for Python and open-source software.
1. Data Integration
|Integration is possible||Integration is possible||Integration is not possible|
|Code repositories (Git, Bitbucket)||✅|
|Data lake support||✅|
|Hadoop (Cloudera, Hortonworks, EMR)||✅|
|Monitoring solutions (log shipping)||✅|
|Proprietary databases (SAS, Teradata)||✅|
|Web data integration||✅|
2. Infrastructure and Hardware
|Supported, and air gapped is an option||Supported, and air gapped is an option||Supported but not air gapped||Not supported|
|Domino Data Lab MLOps||✅|
|Oracle Cloud Infrastructure (OCI)||✅|
|Snowpark for Python||✅|
|On premises (VSphere)||✅|
|On premises (bare metal)||✅|
|GPU and CPU support||✅|
3. Machine Learning Capabilities
|Classification & regression||✅|
|Generative adversarial networks (GANs)||✅|
|Pre-trained large language models (LLMs)||✅|
|Support vector machines (SVMs)||✅|
|Testing strategies (A/B, multi-armed bandit, sensitivity analysis)||✅|
|Text & image analytics and processing||✅|
4. Collaboration and Deployment
|Centralized project hub||✅|
|Deploy with one click||✅|
|Deploy REST API||✅|
|Governance controls for collaboration and deployment||✅|
|Job scheduler / automation||✅|
|Visualizations and dashboards||✅|
|Dedicated support contacts||✅|
|Guaranteed uptime SLA||✅|
|Advanced troubleshooting support||✅|
|Assistance with Anaconda package management||✅|
|Custom conda package builds||✅|
|Custom installer builds||✅|
|Environment management issues||✅|
|Learning: Live and on-demand||✅|
|Repository access during high demand||✅|
|Severity response: Level 1||12 hours, standard1 hour, premium|
|Severity response: Level 2||24 hours, standard12 hours, premium|
6. Security and Governance
|Included||Included||Can integrate||Not possible|
|Administrative monitoring (track users, projects, deployments)||✅|
|Cloud-native security controls||✅|
|Package signature verification||✅|
|Role-based user access controls||✅|
|Scanning for common vulnerabilities and exposures (CVEs)||✅|
|Secure package repository||✅|
|Software bill of materials (SBOM)||✅|
COLLABORATION AND TOOLS
1. Notebooks and Integrated Development Environments (IDEs)
|Jupyter Notebook||Creating and sharing computational documents||✅|
|JupyterLab||Web-based interface for Juypyter||✅|
|PyCharm||IDE for programming in Python||✅|
|RStudio||IDE tools for Python and R||✅|
|Spyder||Scientific Python development environment for scientific programming||✅|
|Visual Studio Code (VS Code)||Source-code editor for debugging, snippets, code refactoring, and more||✅|
2. Data Visualization Capabilities
|Allows users to choose their favorite plotting library (e.g., Bokeh, hvPlot, Matplotlib, Plotly)||✅|
|Supports fully interactive visualizations||✅|
|Supports visualizing very large (i.e., petabyte) datasets||✅|
|Supports visualization in Jupyter or as stand-alone applications||✅|
3. Data Science and Machine Learning Libraries
Anaconda gives you access to thousands of libraries. We name just a few of the most common libraries below to help you compare your options.
|Dask||Parallel and distributed computing||✅|
|Django||Python web framework for design||✅|
|Keras||Deep-learning framework (API for TensorFlow)||✅|
|Kubeflow||ML workflows on Kubernetes||✅|
|NumPy||Mathematical operations on arrays||✅|
|Pandas||Work with data sets—analyzing, cleaning, exploring, and manipulating data||✅|
|Prophet||Time-series forecasting in Python||✅|
|PyTorch||Develop and train deep learning models||✅|
|SciPy||Scientific and technical computing (built on NumPy)||✅|
|Scikit-learn||ML library for classification, regression, and clustering algorithms||✅|
|TensorFlow||Develop and train ML models||✅|
|Theano||Mathematical expressions involving multi-dimensional arrays (built on NumPy)||✅|
|XGBoost||Distributed gradient boosting library||✅|
4. Model Deployment and Management
|Deployment from QA||✅|
|Deployment to production||✅|
|One-click deployment to pre-provisioned resources||✅|
|Refine models in production||✅|
|Reproducibility—rollback to older models||✅|
|Centralized administration of deployed apps||✅|
Anaconda’s Platform Makes Innovation Possible
For more than a decade, industry leaders have been using Anaconda’s platform to build some of the world’s most innovative predictions, products, and experiences. Data science and machine learning teams count on our trusted packages and capabilities to centralize open-source software access and empower consistent, reproducible workflows.
Enterprise practitioners use our platform to collaborate across users and teams, centralize workflows for better reproducibility and scalability, and deploy models into production with just one click.
IT administrators and security teams choose Anaconda because it is the only platform in the Python ecosystem with access to thousands of packages that—unlike those from community package providers—are privately hosted, built from source, and free from malicious packages.
Finally, you can deploy Anaconda in the cloud or on premises—with private cloud, managed hosting, and air-gapped options—making Anaconda the platform of choice for those working in highly regulated industries and/or with sensitive or protected data. With Anaconda, peace of mind evolves from fantasy to reality.
Ready to learn more about how Anaconda can help your teams build and deploy secure Python solutions, faster? Book time with one of our experts to discuss your organization’s requirements.