Data science and related fields have been born in and pushed forward by open-source projects. Open-source communities allow for people to work together to solve larger problems. As stewards of the data science community, we believe it is important to go behind the lines of code to shine a light on those doing the work in open source. In a series of blogs, we’ll highlight several Anaconda employees, the open-source projects they work on, and how their work is making an impact on the larger field.
Q: What is Numba?
Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code. The primary purpose is to speed up Python code written by data scientists. Python is an interpretive language; it is not as efficient as a native, compiled language. For programmers who are familiar with programming languages like C++, they can write a fast program. But for data scientists, they may not be familiar with these native, compiled languages. Also, they are more focused on the science of their projects than the programming. Python is a good language for data scientists because it is easier for them to express.
With Numba, data scientists can easily write a program, and all they need is basically one API — the @jit decorator. For a subset of Python, Numba can turn it into fast code as if someone has written it in C++.
Q: Are data scientists the primary users of Numba?
Initially, that was the thought. But it’s now evolving into something that other Python library writers are leveraging to be part of their library, so that their code can be sped up.
Users can write user-defined functions, and the library may use Numba in conjunction with user code that can then be optimized together into a fast program. Numba is probably the only CPython project that allows a library to just-in-time compile user-defined functions with its internal code to produce a fast program without changing the interpreter. It’s also the only Python compiler project that supports graphic processing units (GPUs).
Q: What is your role within Numba?
I’ve been working on Numba for eight years at Anaconda. Before that, I was working on a similar project that got absorbed into Numba. I’m currently the team lead, so I work on the project and manage a team responsible for it. At Anaconda, we have a team of three working on Numba.
I am also the leader on the community side as well. In the open-source community, hardware vendors have assigned developers to work with us. There are also other companies, many of them finance-related, that are involved in the development of Numba. They contribute, and they help answer community questions.
This year, I’ve been working on the governance model for the community to grow. Our project has grown a lot, especially in the last two years — almost to a size with which the three of us at Anaconda are unable to keep up. We are looking at involving more hardware vendors to contribute more and help with the daily maintenance tasks.
Q: What contribution to this project are you most proud of?
A compiler is connected to everything. So, I’m proud of being able to keep up with the complexity of the compiler and of the amount of testing we put into this project to ensure it is stable for users. We are being used in critical infrastructure like risk analysis for financial institutions. A bug in Numba means an incorrect result. It can ruin some very important and expensive projects.
Q: What use cases for your project do you find the most interesting or surprising? What software does Numba enable?
In 2020, there have been 466k average monthly Conda package downloads of Numba.
As I mentioned, we work with hardware developers from companies like Intel and NVIDIA. Pandas and Datashader both use Numba. Every user of these is indirectly using Numba. Another important user of Numba from the science community is The Awkward Array project, which is sponsored by the National Science Foundation and DIANA/HEP. DIANA/HEP develops software for processing petabytes-sized data related to CERN’s Large Hadron Collider (LHC).
Bodo is founded by someone that is part of the core community team of Numba, and Numba is underneath their products. Rapids from Nvidia is using Numba to do some Pandas-like operations on the GPU. We helped them prototype in the initial phase of the project.
Q: What are you working on now that you are most excited to release?
With the current release, we are focused on enhancing performance. The whole program is about speeding things up, so a lot of what we do is about making that better. We are tackling some really complicated ways for users to be able to write less code to achieve the same speed. We are constantly adopting the latest research to improve speed. Particularly, we are working on reducing the cost of automatic memory management (reference-counting). We are seeing a speedup of 2x to 10x in some programs with the new version.
Q: What do you envision for this project in a year from now?
We want to grow the open-source community. We’d like the hardware companies we work with to help us maintain the community and test the program against their hardware. We want to transfer knowledge to the public, so they can help us maintain and grow.
Q: In your mind, what is the value of open-source projects?
Open source allows a project to have a lifetime that is not bound by the lifetime or interest of a company. For companies that don’t release their products as open source, as company interest or focus moves away, the project becomes abandoned. Software is never complete. Software nowadays is so complex, and users’ requirements are constantly changing. Open-source projects allow for user feedback, help, and involvement to continue moving a project forward, which is something a proprietary project cannot do in the same way.
At Anaconda, we’re proud to support our employees’ involvement in open-source initiatives. To learn more about Numba and how we contribute to other open-source projects, visit our Open Source page.