Ask a data scientist what they enjoy the most about their job, and you’re bound to hear several different responses. They may say they enjoy tuning model parameters, generating predictions, or keeping up with the field’s latest advances. The list goes on, but rarely included is the process of data preparation, which, according to the 2020 State of Data Science report, makes up approximately 45% of data scientists’ responsibilities. Data preparation includes loading and cleansing data before feeding it into a model and is notorious for being the most time-consuming step of the data science lifecycle.
Data quality is at the core of good data science. Before models are even built, data scientists must verify that the data is clean, relevant, and complete for the experiment on hand. Despite being a cornerstone of the process, data cleaning and preparation are generally considered “drudgery.” This may be because the tasks that make up data prep can be tedious, such as carefully standardizing thousands of rows of data or fixing errors from integrating multiple data sets. Moreover, data prep work seems less glamorous than deploying models or making visualizations of results.
It’s tempting to think that you could automate this work with a press of a button, but the reality is you can’t fully take data preparation out of the data scientist workflow. While certain parts of the process will be better suited to be completed by a machine, there will always be some aspects of data preparation that data scientists must see through on their own. Data preparation helps data scientists better understand the data and its limitations, and consequently, build better models.
Benefits of knowing your data inside and out
Model outputs are only as valuable as a data scientist’s trust in their quality. That starts with an intimate familiarity with the datasets used to train the model and production. The process of wrangling, cleaning, and integrating data builds this familiarity, which is why it’s so important. Even when a data scientist has high-quality datasets, they should be skeptical and evaluate its provenance and consider the resulting implications. The careful examination of data preparation data enables data scientists to trust their data.
We can liken data scientists to sushi chefs—both are roles where the quality of the inputs (whether raw fish or data) is key to the final product. The best sushi chefs will frequent fish markets to survey available fish, smell the freshness, and select what will work best for their needs. Even though it may seem like more tedious and basic work than the artistic aspects of preparing a roll, the reality is that taking this kind of hands-on approach gives the chef confidence in knowing they have picked the best ingredients from what’s available. The chef doesn’t necessarily go out and catch the fish themselves—just like data scientists don’t need to complete the entire data preparation process by hand—but they are involved enough to know that low-quality inputs won’t diminish the final result.
Data preparation is also a form of exploration. Getting into the weeds of a dataset equips data scientists to connect the dots better, identify patterns, and unearth potential features for modeling and prediction stages. Data scientists often discover new insights that change how they were planning to approach the model design. Without the close familiarity gained from the steps of data wrangling and cleansing, data scientists would miss out on these opportunities for a better understanding of the factors that may shape the final output.
Not all automation is bad
This isn’t to say that all attempts at automating data preparation are bad because some aspects of data cleansing and loading are well-suited for automation. The $2 billion data quality tooling market will bring more tools to accelerate the data prep process. Data anonymization, for example, is crucial when working with data that involves personally identifiable information or other sensitive attributes. The method of replacing real personal data with similar but fake information is more suited for automation than what a human can accomplish at the same time. Some software libraries, such as pandas, can also detect missing data and outliers at speeds much faster than a human looking through a dataset.
But automation and tooling can’t apply a critical lens to data findings. If a tool automatically identifies an outlier in a dataset, human intuition and expertise are still required to make the final call on whether the inconsistency is an error or an anomaly that needs further examination in the modeling process. This is where a data scientist’s judgment is critical in the data prep process.
As mundane and time-consuming as data preparation may seem, it’s important to remember that the time it takes pays dividends in the form of better models and more accurate results. The best data scientists embrace elements of this process as the stepping stones to building quality models and delivering the best insights. While it’s exciting to see new solutions to ease the process, we must accept that doing some data prep elements by hand will continue to be integral in delivering the most value possible from data science work.
Want more tips for efficient data preparation? Check out this guide that shares tools and tricks to make each step of the process more effective.