Behind the Code of Metagraph: Q&A with Core Architect Jim Kitchen
Data science and related fields have been born in and pushed forward by open-source projects. Open-source communities allow for people to work together to solve larger problems. As stewards of the data science community, we believe it is important to go behind the lines of code to shine a light on those doing the work in open source. In a series of blogs, we’ll highlight several Anaconda employees, the open-source projects they work on, and how their work is making an impact on the larger field.
In this blog, I interviewed Jim Kitchen, who is a Senior Software Engineer at Anaconda and a core architect on Metagraph, a new Python library which provides a common entry point to graph algorithms through an orchestration layer on top of existing graph libraries.
Q: What is Metagraph?
The graph landscape in Python is vast, which is understandable. Graphs cover a lot of different needs and have many different approaches. The whole ecosystem is very fragmented. Metagraph is a new project that just went open source in September. It is our attempt to provide a consistent entry point into graph analysis in Python. You can write your graph workflow using a standard API and then have that dispatch to lots of different graph libraries that plug in to Metagraph.
It’s an ambitious goal — we won’t get everything right. But we think it’s worth going after because it is a space that needs more uniformity and accessibility.
Q: What is your role within Metagraph?
I am one of the core architects on the project. We look at understanding the graph landscape and then design the user experience for Metagraph. Again, we’re trying to design something knowing that there will be interesting corner cases that we haven’t thought about. We’re trying to make it flexible enough to handle those without being too abstract because we don’t want to over engineer this. We’d rather get it in the hand of users, start getting their feedback, and then iterate until it becomes something people enjoy using.
Q: Now that the project is open source, what are you hoping other people will contribute?
Usually you have a library, and then you have users who use that library. Metagraph is an interesting project because it is trying to sit on top of and be an umbrella for lots of other libraries. There could be those who use Metagraph and give us feedback and then there are those who are graph library owners who write plugins for Metagraph. They’re also contributing to Metagraph but in a different way than a typical user. They’re plugging into the ecosystem. We have these two audiences and perspectives that we’re looking for.
Q: This project is funded as part of a larger effort with the Defense Advanced Research Projects Agency (DARPA) — can you provide more background on that?
It’s part of the DARPA HIVE (Hierarchical Identify Verify Exploit) program, which is an ongoing project of many years that provides funding for corporations and research labs who push the broader graph landscape forward. We’re just one small piece of that but an important one in Python.
With us, they want to ensure that we’re not pigeonholing ourselves into one way of thinking because whatever we deliver needs to live beyond this project. Ideally, it will grow beyond even what we imagine it could be. We have quarterly deliverables for them to keep us focused, but they give us vast latitude because they trust our expertise.
Q: What are you working on now that you are most excited to release?
Last quarter and through December, we’re working on the integration with Dask, another open-source project that came out of Anaconda. Dask has grown far beyond Anaconda; it now has a stable community of people who are invested in it. We’re tapping into that. Integrating Dask into Metagraph lets us do distributed computation.
Q: What has been the biggest challenge while working on this project?
The graph landscape is so vast. Trying to find a set of commonalities between all of these different libraries that we can integrate into our interface is challenging.
I can make a corollary with something like SQL: you have one language that can run on lots of different databases. It’s a similar goal that we’re going after, and yet, every single relational database that’s out there, whether it’s Oracle or SQL Server, they all want to add their own extra things. If you want real performance, you have to learn those extra features. We’re going to run into the same thing.
It’s not a solved problem. It’s always going to be with us. It provides an interesting challenge. Standardization vs. flexibility — what is the right balance there?
Q: What do you envision for this project in a year from now?
I would hope that we have users who are aware of Metagraph and giving us feedback. What I would really like to see is that when PhD students write their papers and present at conferences, they take their work and add a Metagraph wrapper around it. Then, in their paper or wherever they’re publishing their results, they can tell readers who want to try it out to install Metagraph and to install their Metagraph plugin. Their readers can immediately run their code — it’s as easy as that.
That’s one of the big things we want to do: the person who is an expert in their code writes a plugin. Now, anyone can use that code without having to become an expert in how they wrote their code.
Q: In your mind, what is the value of open-source projects?
There’s a huge value with open source, especially with people who don’t know what they need yet. If you’re an engineer at a company and you have a chunk of data that you need to process, you can get stuck. You don’t really want to experiment with a commercial solution that you aren’t sure will meet your needs, and you often need a novel or even cutting-edge solution to the problem. Instead, you can download a few open-source projects, read some blogs, and try it out.
There is a ton of innovation in open-source that enterprises have come to rely upon. In fact,
I’ve seen a lot of grassroots data science in organizations. They’re building amazing stuff. Open-source projects usually become as high or better quality than commercial products.
Q: Why should companies provide opportunities for employees to be involved with open-source projects?
Very few people have the luxury of working full-time on open-source projects without being paid to do so. I have seen people who are supporting key components of an open-source project on their nights and weekends without compensation, and that’s not a good situation. Anaconda recognizes the value and lets us work on this — that’s huge.
In tech, it’s common to have roles dedicated to open-source projects. In other industries, it’s less common, but that’s starting to change. Before, companies thought that if they had something valuable, they should keep it as a trade secret. But now, there’s been some interesting examples when companies did let something become open source, and it had lots of value.
First, it signals to the outside world that this is a great place to work. You can come here and work on open-source projects. You can do cool work. For a company who is trying to attract top talent, that’s a reputation that you can’t buy any other way. Second, if several companies are building competing projects and one decides to open source the work, that product will often become the de facto standard that everyone else eventually has to adopt.
There are tangible benefits for companies to be more open source friendly. To make it more sustainable, you need company backing of open-source projects. Employees will learn to find the balance between a company’s needs and the needs of the broader community.
At Anaconda, we’re proud to support our employees’ involvement in open-source initiatives. To learn more about Metagraph and how we contribute to other open-source projects, visit our Open Source page.