Why Data Scientists Should Be Excited About Python in Excel
I know what you’re thinking. If you’re an experienced data scientist, your work with Excel may be grudging or fleeting. But in stepping outside the rarified realm of data science and engineering, you will discover a very different reality. People in positions of responsibility have used spreadsheets to make critical decisions for several decades now. In the last chapter of Fundamentals of Data Engineering, Joe Reis and I had this to say about spreadsheets:
“What’s the most widely used data platform? It’s the humble spreadsheet. Depending on the estimates you read, the user base of spreadsheets is between 700 million and 2 billion people. Spreadsheets are the dark matter of the data world. A good deal of data analytics runs in spreadsheets and never makes its way into the sophisticated data systems that we describe in this book. In many organizations, spreadsheets handle financial reporting, supply-chain analytics, and even CRM.”
Who uses spreadsheets? CFOs use them to report official quarterly earnings for publicly traded companies. CMOs use Excel to track spending on hundred-million-dollar advertising campaigns. Epidemiologists, economists, supply chain managers, and procurement officers use spreadsheets for pandemic contact tracing, inflation analysis, inventory tracking, and RFPs.
Failure to Communicate
Meanwhile, data scientists are accused of blowing through piles of money to execute extravagant projects that deliver little concrete value. Poor communication and collaboration are often to blame. In our excitement over the latest tools and techniques, we fail to understand the larger goals of the organizations that employ us, or to communicate the potential impact of data science projects. Thomas C. Redman wrote in Harvard Business Review about lessons learned as a newly graduated PhD working at Bell Labs.
“According to LinkedIn, the top 10 skills for a data scientist include machine learning, R, Python, data mining, data analysis, data science, SQL, MatLab, big data, and statistical modeling. The focus is on skills, and many data scientists are perfectly content to apply those skills while sitting at their computers and plowing through ever-increasing amounts of data in the hopes of finding something interesting. But it is not enough to put data scientists in the right spots and let them work. You need to instruct them to fully engage in your business, show them how things really work, and help them connect with others in the organization.”
Okay, I’m talking about communication, but this post is supposed to be about Excel. How is Excel going to help us improve our communication skills and become better data scientists?
An Improved Collaboration Tool
Data scientists and business users quite often exchange spreadsheets as a form of collaboration. Right now, this is an awkward process; while there are a variety of libraries that allow Python to read from and write to spreadsheets, these tools are complicated to use, especially on the business stakeholder side of the exchange. In practice, data scientists have to do a bunch of manual work to update spreadsheets and make collaboration possible.
While no tool can magically resolve human communication problems, Python in Excel will create a common working platform for data scientists and spreadsheet users, dramatically streamlining the collaboration process. When tools like Slack, Git, or Asana are used well, they create a sense of seamless coworking on a common problem. Python in Excel has the potential to deliver a similar experience, finally allowing us to move beyond the throw-it-over-the-wall mentality that is common right now.
Data scientists still need to stop by the desks of their business colleagues to understand the problems they struggle with and the goals they’re trying to achieve. But once they do, they will be able to provide neater, cleaner deliverables with less manual work and friction. It will be possible to bake analysis and modeling code right into a spreadsheet. Users can execute Python code within their sheets just like they would conventional Excel formulas, without complex local installation of code libraries.
For example, a product-buying team will load inventory and sales data into a spreadsheet, and with a few clicks can set parameters based on next week’s sales goals, update a time series model to forecast future sales, and export data to place more orders. This spreadsheet is built by a data scientist using the Python-based tools that they code in every day, but the buying team will update their data and execute the code independently. Rather than constantly pinging the data scientist to ask why the refreshed model with the latest data is late, they can focus their discussions on model performance and potential future improvements.
Toward a Better Excel
When I started working with the early Python in Excel beta a few weeks ago, I was expecting a slight extension of the Excel formula system through the addition of Python-based functions; the tool I discovered is actually a fusion of the Excel visual paradigm familiar to business users with the notebook approach that data scientists prize in tools like Jupyter. Valerio Maggio’s excellent post on running a machine learning experiment with Python in Excel demonstrates the kinds of things that are possible.
In practice, while data can be embedded within the sheet itself with its associated row limitations, users now also have access to the full machinery and scale of pandas dataframes up to the limits of available memory. Excel also supports named global Python variables, critical to tracing algorithms and understanding code. Thus, Python in Excel begins to solve some of the problems that data scientists contend with on a daily basis.
A Few Caveats
As of this writing, Python in Excel is in beta. For the moment, you can run a variety of data-science-oriented tools, but the functionality is highly sandboxed and has limited connectivity to data sources. In addition, it is not yet recommended for production use. Even so, you will be able to get a pretty good sense of the power of this tool so you can plan for future use.
Give It a Try
If you are a data scientist reading this piece, I assume that you are extremely skeptical — I certainly was. But if you work and collaborate with business users who work in Excel, set your skepticism aside and give the tool a try. It is much more powerful and easy to use than I had imagined and has the potential to transform aspects of the way we work together.
Disclaimer: The Python integration in Microsoft Excel is in Beta Testing as of the publication of this article. Features and functions are likely to change. Don’t hesitate to reach out if you notice an error on this page.
Matt Housley holds a Ph.D. in mathematics and is co-author of the bestselling O’Reilly book Fundamentals of Data Engineering. He began learning about computers on 8-bit machines, such as the Commodore 64 and Apple IIc, and eventually learned Python through teaching mathematics before working as a data scientist and engineer. He currently writes, trains, consults, and podcasts on data engineering, data strategy, and data policy.
Talk to an Expert
Talk to one of our financial services and banking industry experts to find solutions for your AI journey.