Anaconda’s Q4 2022 Open-Source Roundup
Dec 13, 2022By Martin Durant
Substantial and impactful open-source innovation is at the heart of Anaconda’s efforts to provide tooling for developing and deploying secure Python solutions, faster. With the goal of capturing and communicating our teams’ many ongoing contributions to a wide variety of open-source projects, we are now providing regular roundups of related news items on our blog.
As usual, Anaconda’s open-source software (OSS) teams have been very active over the last few months! In this first edition of our new quarterly OSS roundup, I’ll highlight some of our biggest open-source contributions plus a couple of smaller but still very interesting efforts. I’ll also touch on what’s coming in the next few months.
Note: Please see this recent PyScript post for some updates on that particular project, as they will not be covered again here.
Highlights by Dev Group
Anaconda has many different teams working on open source, and each performs a wide variety of tasks. Below I will cover some of our core efforts and recent milestones. Please note that the split into bullets is merely for readability; in practice, many of us work across these divisions.
Dask and Data Access
The Awkward Array project provides vectorized (fast!) data processing for the data that just doesn’t fit into normal arrays or tables—nested and variable-length, “JSON-like” data—all with familiar NumPy syntax. Since this is squarely aimed at big data, it makes sense to want to do this processing in parallel and distributed on a cluster. Dask-awkward does exactly this, and will be released roughly concurrently with both this blog post and V2 of Awkward itself. The library brings the full Awkward API to big distributed data and is ready for general use.
Out of the same effort, we created awkward-pandas, where you have a mix of nested, variable-length data and ordinary flat columns. We bring the power and speed of Awkward into the pandas API with our own new extension type and convenient methods for converting to and from Python/pandas native types. This includes integration into the JSON and Parquet load/write mechanisms. This was released as alpha at PyData Global, and work on dask-awkward-pandas is ongoing.
The Intake library for data access and cataloging is mature and stable. Recently we’ve invested new effort into rejuvenating and pushing the project forward. Of particular note, the Intake graphical user interface (GUI) will soon offer new functionality—integration with hvPlot’s explorer and more interactivity when it comes to editing and building your data sources, plots, and catalog files dynamically.
fastparquet may be largely in maintenance, but we do still improve it. 2022.11.0 brings speed for nullable types, schema evolution, and in-place metadata updates.
Following extensive involvement with python-graphblas, which brings optimized graph processing to Python, we’ve concentrated on rounding off the library and providing extensive documentation to the community.
While some users have not yet transitioned to JupyterLab (or the upcoming Notebook V7), Anaconda has stepped in to revive maintenance of the “classic” Jupyter Notebook codebase. This has enabled several new security and bug fix releases of Jupyter Notebook 6.x.
Please see this blog post on the release of Jupyter 6.5 as a transitional point on the way to Notebook V7, but with many updates and stability fixes.
In order to enable longer-term support of the classic Notebook, the frontend code has been moved into the nbclassic package, which can coexist in environments along with JupyterLab and the future Notebook V7.
The team has been engaging the wider Jupyter community to understand their needs, and will be looking in 2023 at documentation and features to ease the transition of extensions and extension authors to the new JupyterLab-based system.
Finally, as an interesting point for technical folks, we converted all of the Selenium-based tests in nbclassic to Playwright, which was a major development to pull off, but has increased the reliability of the test suite significantly.
Bokeh 3 was a major release of the low-level interactive graphics library on which the whole HoloViz stack relies. Of particular note, the layout system was rewritten to reuse more modern browser primitives rather than handling sizing and placing internally, which should result in better interoperability with other graphical components in a page/app.
- The Bokeh changes allowed for (and required) work across the whole related stack, which resulted in the following releases over the last few weeks:
I’d particularly like to highlight Panel 0.14, which fully integrates with PyScript so you can run interactive Python data visualization applications without a server and provides the explorer (the same as is used in conjunction with the aforementioned Intake GUI) for interactively building views of dataframe data.
Conda has become much more open and community driven this year. Note, for example, the enhancement proposal process, and continued conversation and collaboration with mamba.
Conda moved to calendar versioning and a regular release cycle starting with version 22.x.
Plugins are now supported by conda’s architecture, so developers can create and offer new functionality without adding code to the main repo.
The main artifact used by conda has moved to V2 “.conda” files, with greatly improved download and unpacking speed.
BeeWare is a collection of tools for writing Python applications that can run with native look and feel on mobile, desktop, and web platforms. The BeeWare project maintains its own blog with monthly updates, roadmaps, and other news. A very quick glance makes it clear there’s been plenty of recent activity, particularly around Briefcase, the build/deploy system.
Binaries packages have arrived for Android and iOS, so now you can include popular libraries like NumPy and Matplotlib in your mobile Python app.
Build systems for Python 3.11 were ready as soon as it was available.
A complete rewrite of the testing infrastructure is now in progress. Briefcase now has the ability to run a test suite inside the app simulator environment, and this capability is being used to implement a comprehensive, cross-platform test suite for the Toga GUI toolkit.
Numba is a just-in-time (JIT) compiler for Python code optimized for running numerical algorithms on CPU and GPU backends. Much work was done this quarter to support Python 3.11 and upgrade to LLVM 14. These tasks are ongoing, but should be landing in the development branch soon.
In preparation for a big push in 2023 to modularize Numba for easier reuse in other projects that need compiler components, we’ve been moving forward on several proof-of-concept efforts. We will see these components folded into Numba or potentially new projects over the coming year.
The Numba team has been doing a major rewrite of the bytecode analysis frontend to better handle the rapidly evolving bytecode changes that come with each minor release of Python. This work should help us to roll out Numba updates for new Python releases faster, and also enable other compiler enhancements in the future. Look for this to land in Numba sometime in Q1 2023.
We have also been hard at work continuing to improve internal usage of Numba’s extension APIs, which has enabled improvements to the compute unified device architecture (CUDA) target that both increase functionality and reduce the size of the code. This work should also allow for more consistent math behavior in the future.
spatialpandas to Awkward
spatialpandas is a library for working with geometric objects as one column of a pandas dataframe, with other normal columns and a view for aggregations and visualization. Following our work on awkward-array for Dask and for pandas, we realized that spatialpandas could make use of these tools. In particular, polygons and lines can be represented as variable-length arrays of points, with each point made of two or more numbers. This is exactly the kind of data structure that Awkward deals with. Preliminary experiments show that we can swap out a lot of complex ac-hoc legacy code from spatialpandas in favor of calling well-tested modern code in Awkward (via awkward-pandas), and get a decent speed boost from the change too. This is a very nice example of using our own up-and-coming tools for a variety of use cases, and we will be developing this functionality further in the coming months. Watch this space!
Kerchunk is a library for making virtual datasets out of many other datasets of several possible formats, and providing the benefits of cloud-native data access without copying or reformatting the original files. It’s been around for a little while, but this quarter it received renewed attention and effort, so we were able to surface additional features such as:
Consolidating many nearby reads within a target file to reduce the number of calls
Scanning files with a tree-reduction scheme using Dask
Coordinate creation utility for geotiff
Automatic extraction of smaller chunks when the target is not compressed
With this new push, expect a lot more news regarding this project over the next six months.
See you next quarter!
About the Author
Martin Durant is a former astrophysicist with several years of scientific research experience. He has also worked in medical imaging, building AI/ML pipelines and a research platform. After a brief stint as a data scientist in ad-tech, Martin moved to Anaconda to work on PyData education. He now leads a number of open-source PyData projects, focussing on data access, formats, and parallel processing.