Open Source

Tools and libraries for data science, machine learning, and AI

Support OSS

You can help support ongoing innovation on projects in the open-source community. Your donation goes directly to NumFOCUS and supports their work.

Get Started

Many of the most commonly used open-source data science and machine learning packages are automatically installed when you download Anaconda Distribution and thousands of others.

 
Table of Contents

The Fundamentals

Jupyter

Jupyter is an open-source project created to support interactive data science and scientific computing across programming languages. Jupyter offers a web-based environment for working with notebooks containing code, data, and text. Jupyter notebooks are the standard workspace for most Python data scientists.

Pandas

A library for tabular data structures, data analysis, and data modeling tools, including built-in plotting using Matplotlib. pandas aims to be the fundamental high-level building block for doing practical, real-world data analysis with Python.

SciPy

The SciPy library consists of a specific set of fundamental scientific and numerical tools for Python that data scientists use to build their own tools and programs. It provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.

NumPy

A core package for scientific computing with Python. NumPy enables array formation and basic operations with arrays. NumPy is used for indexing and sorting but can also be used for linear algebra and other operations. Many other data-science libraries for Python are built on NumPy internally, including pandas and SciPy.

Machine Learning

Keras

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

TensorFlow

TensorFlow is an end-to-end open-source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that let researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

PyTorch

An open-source deep learning framework using GPUs and CPUs that consists of fundamental tools and libraries for Python AI and machine learning development.

Scikit Learn

A powerful and versatile library for machine learning basics like classification, regression, and clustering. It includes both supervised and unsupervised ML algorithms with important functions like cross-validation and feature extraction. scikit-learn is the most frequently downloaded machine learning library.

Data Visualization

Matplotlib

Matplotlib is the most well-established Python data visualization tool, focusing primarily on two-dimensional plots (line charts, bar charts, scatter plots, histograms, and many others). It works with many GUI interfaces and file formats, but has relatively limited interactive support in web browsers.

Bokeh

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications.

Plotly

Plotly’s Python graphing library makes interactive, publication-quality graphs. It is a popular and powerful browser-based visualization library that lets you create interactive, JavaScript-based plots with Python.

HoloViz

HoloViz is an Anaconda project to simplify and improve Python-based visualization by adding high-performance server-side rendering (Datashader), simple plug-in replacement for static visualizations with interactive Bokeh-based plots (hvPlot), and declarative high-level interfaces for building large and complex systems (HoloViews and Param).

Dashboarding

Panel

Panel is an open-source Python library that lets you create custom interactive web apps and dashboards by connecting user-defined widgets to plots, images, tables, or text.

Dash

Dash is a productive Python framework for building web applications. Through a couple of simple patterns, Dash abstracts away all of the technologies and protocols that are required to build an interactive web-based application.

Voila

Voilà turns Jupyter notebooks into standalone web applications. Unlike the usual HTML-converted notebooks, each user connecting to the Voilà tornado application gets a dedicated Jupyter kernel which can execute the callbacks to changes in Jupyter interactive widgets.

Streamlit

Streamlit is an open-source Python library that makes it easy to build beautiful custom web-apps for machine learning and data science. It runs on a simple and powerful app model that lets you build rich UIs incredibly quickly.

Image Processing

Pillow

Pillow (a “friendly fork” of the older PIL library) is a Python imaging library and a general image processing tool with support for opening, manipulating, and saving images in many different file formats.

Scikit Image

scikit-image is an open-source Python package containing a collection of image-processing algorithms, including segmentation, geometric transformations, color space manipulation, and feature detection. It uses NumPy arrays as image objects.

OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library with C++, Java, Python, and MATLAB interfaces. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products.

Scalable Computing

Numba

Numba is a high-performance Python compiler. It makes Python faster and optimizes the performance of NumPy arrays, reaching the speed of FORTRAN and C without a an additional compilation step.

Dask

Dask is a Python package used to scale NumPy workflows with parallel processing to enable multi-dimensional data analysis, enabling users to store and process data larger than their computer’s RAM. Dask can scale out to clusters, or scale down to a single computer. Dask mimics the pandas and NumPy API, making it more intuitive for Python data scientists.

Rapids

The RAPIDS data science framework is a collection of libraries for running end-to-end data science pipelines completely on the GPU. The interaction is designed to have a familiar look and feel to working in Python, but utilizes optimized NVIDIA® CUDA® primitives and high-bandwidth GPU memory under the hood.

Apache Spark

A fault-tolerant cluster computing framework and interface for programming clusters launched by UC Berkeley. Developed for the Java/Hadoop ecosystem but with support for Python. PySpark is the Python API for Spark.

Data Pipelines/ETL

Apache Airflow

An open-source workflow automation tool by Apache for creating data workflows, scheduling tasks, and monitoring results. It integrates with multiple cloud providers, including AWS, Azure, and Google Cloud.

Intake

A data ingest/loading library for a wide variety of file formats and data services, with hierarchical cataloguing, searching, and interactivity with remote storage platforms under a single interface.

Natural Language Processing

NLTK

An open-source Python natural language toolkit for symbolic and statistical NLP. It includes a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning in multiple languages.

Gensim

A Python library for topic modeling, document indexing, and similarity retrieval for large bodies of text with efficient multicore implementations of NLP algorithms.

spaCy

spaCy is an open-source Python library for NLP and one of the fastest, if not the fastest, syntactic parser. spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython.

Looking Ahead: AI Frontiers We’re Watching

ONNX

An open neural network exchange making machine learning models portable between frameworks and platforms. Microsoft and Facebook started this community in 2017 to create an open ecosystem for interchangeable models.

Fairlearn

A burgeoning project by open-source developers at Microsoft. FairLearn is a Python package for assessing fairness and mitigating unfairness in ML models and AI systems.

AI Fairness 360 (AIF360)

A comprehensive open-source Python toolkit of metrics that checks for and measures bias in datasets and ML models. It also included algorithms to mitigate bias. This toolkit was developed by IBM’s open-source team.

InterpretML

An open-source Python package that makes it easy to compare algorithms for interpretability. It provides a “scikit-learn style uniform API” and includes an interactive visualization platform and dashboard so data scientists can compare algorithms with ease.

LIME

LIME is a PyPI package and a model-agnostic interpretability tool. LIME explains individual predictions for text classifiers that act on tables or images. Support for scikit-learn classifiers is built into the tool.