Data represents one of the most important competitive advantages in modern business, and the increasing reliance on data throughout decision-making processes has elevated the role of data scientists and the IT teams that support them. But like all departments, data science teams still encounter their share of headaches, frustrations, and, well, horror stories.
Light a candle and, if you dare, read up on these spine-tingling data science tales from beyond the desk!
When You Have to Find the Needle In A Haystack
Sean Law, Principal Data Scientist at Charles Schwab, ran into this “needle-in-a-haystack” problem recently in a 23-million-row JSON file that kept returning as invalid. Having to parse through all that data to find the few small errors breaking the system sounds like a nightmare of a time, but is just another day for data scientists.
In today’s edition of data science hell: Finding the multiple different needles in a 23M row *invalid* JSON file. 😭
— Sean Law 🇨🇦 (@seanmylaw) October 26, 2022
This sort of issue is something that most if not all data scientists are familiar with, including Anaconda Senior Data Scientist Vicky Kwan who encountered a similar problem when looking into an infrastructure bug. After searching for the missing “ }” in 300 lines of CloudFormation, an infrastructure-as-code (IaC) service for modeling, provisioning, and managing AWS and third-party resources, Vicky eventually fixed the bug and kept Anaconda’s servers running.
Format errors and data conversion nightmares aren’t unique to data scientists; technologists in all departments, including IT, frequently wrestle with such challenges. Parsing through data is laborious and exhausting, but addressing these kinds of problems is an important part of turning data into insights and moving businesses forward.
We asked around internally for some tips and tricks for dealing with data conversions and formatting. Here are a few of our team’s suggestions:
IT teams frequently help convert data to different structures, like JSON to CSV. Check out the pandas library in Python to simplify this process.
Use existing parsers rather than writing your own. And while you’re at it, use a parser instead of a regular expression. The road of “I’ll just use a regex to extract an email from this HTML doc” holds nothing but tears.
Never forget to hand-scan a sample of your data after any conversion. There are lots of obvious issues that you’ll catch just by scrolling through some of the records yourself, and it’s easy to assume “no errors raised” means “everything converted as I expected.”
If it’s a conversion you need to run over and over, an automated “quick consistency check” for basic things like tracking records, times of events, or values is amazingly helpful.
When the Model Is a Little Too Accurate
If your model is returning with nearly 100% accuracy, your first instinct should be to look for where you went wrong because chances are a mistake was made somewhere. And if that model flies under the radar and makes it to production, something is going to break. This problem could have serious consequences for the business and is an important element to manage and monitor.
According to a recent Reddit thread, multiple users had encountered this scenario, and they shared their spine-chilling tales:
I was reviewing a Data Scientist’s predictive model. They had used a neural net, probably because that’s what they knew. Their model was extremely accurate. Next to perfect. Which for me was a red flag. So I dug deeper. They were using time series data that had a lot of missing data as time went on. Because they needed a matrix, they had to do something with the missing data points. So they imputed all missing data with a value of 0. A significant portion of the data was now just 0. After I pointed out the problem, the model turned out to be useless. – Reddit user
A coworker confidently declared his model had a 98% accuracy. In the code review I found he wasn’t properly splitting his train and test sets. He was training on nearly his entire test set….I wasn’t supposed to be reviewing the code and it nearly made it into production. – Reddit user
I just commented this on another post: a phd of economics pushed me away for asking if he validated his model. I noticed he didn’t split his dataset at all and, I quote, made “economical assessments”….His model was deployed and was blatantly incorrect: it would predict a housing quality score, and estimated residential neighborhoods with high income as “poor quality”. – Reddit user
Code is code, whether it’s meant for pure software development, managing infrastructure (IAC), or your data science models. The application of tried-and-true software development techniques to data science might be novel, but you can and should apply those hard-won lessons from other areas of technology to data science. For example, pair programming during model development can help find and eliminate mistakes and increase your velocity.
No one makes a model fail on purpose, and getting a second—or third—set of eyes on your work is the best way to ensure improvement. Requiring code reviews prior to publishing can also strengthen your team’s output.
Finally, documentation must be considered at every step of the process. Not only will this keep your code transparent and allow others to spot things you might have missed, but proper documentation is also a critical part of IT governance and will make your auditors happy.
When You Realize Your Mistake Is on Full Display to Stakeholders
Fixing bugs and reviewing models can be enormously time consuming when you don’t know exactly where you went wrong. But sometimes you spot it right away—only, in front of coworkers or stakeholders.
Dan Killam, Environmental Scientist at San Francisco Estuary Institute, recently shared a scary data science story on Twitter about a truly hair-raising happening: presenting to a critical audience with mislabeled objects:
In data science hell, you’re forced to present to a critical audience about all the objects you made labeled “test”. #rstats
— Dan Killam, PhD (@DantheClamMan) April 8, 2022
And in a Medium post, Vincent Vanhoucke, Distinguished Scientist at Google, talks about one of his first projects as an intern and how one failure lead to the lesson of a lifetime:
Picture this other horror story: You’re an intern, and you’re asked to build a “yes” versus “no” speech classifier. You have audio files: yes1.wav, no1.wav, yes2.wav, no2.wav, yes3.wav, and so on. You build your classifier and obtain great results. The moment you are about to present your work, you discover that the only thing your model is actually doing is reading the words “yes” or “no” in the filenames of your audio files to determine the answer, and not listening to the audio samples at all. So you cower in shame, cry a lot, and find the nearest exit. –My Data Science Horror Story, Vincent Vanhoucke
We all make mistakes, which is good news for those of us who are just beginning our careers. Because those who came before us, those we look up to as luminaries in our fields—they make mistakes too. It’s expected. Simply ask your co-workers, and everyone will have an example of a technical demo that went sideways, a presentation to leadership that didn’t hit the mark, or a production change that caused a global outage.
Some would argue that if you aren’t making mistakes, you likely aren’t pushing yourself to achieve your peak performance. So accept your stumbles with self-compassion, have a sense of humor about them in the moment, and, most importantly, learn from them. Sometimes mistakes are the best teachers.
Don’t Let Scary Stories Hold You Back
There’s a lot of overlap between the challenges of data science teams and IT teams, particularly when it comes to ensuring organizations are prepared to meet future goals. As the lifeblood of the proper provisioning and configuration of IT resources, digital asset stability, and security, data can feel like a business area that’s high stakes and, sometimes, downright spooky.
There are moments over the course of any career that are scary, embarrassing, or stressful, but such moments are necessary for learning and growth. With some effort we can come out the other side and look back at past “horror stories” as moments that challenged us to become better. With resilience, a supportive community, and great tools, we’ll surely find some king-size candy bars along the way.
Happy Halloween from Anaconda!