Why Understanding CVEs Is Critical for Data Scientists
Mar 30, 2020By Nick Malkiewicz
CVEs are Common Vulnerabilities and Exposures found in software components. Because modern software is complex with its many layers, interdependencies, data input, and libraries, vulnerabilities tend to emerge over time. Ignoring a high CVE score can result in security breaches and unstable applications.
Because data scientists work with vast stores of data, they need to take responsibility for the software components they use to minimize risk and protect customer data. A golden rule in security is, wherever valuable data can be found, hackers will go.
Software developers refer to CVE databases and scores on a regular basis to minimize the risk of using vulnerable components (packages and binaries) in their applications or web pages. They also monitor for vulnerabilities in components they currently use. To reduce the risk of a security breach from open-source packages, data science teams need to take this page from the software developer’s playbook and apply it to their data science and machine learning pipeline.
But isn’t open-source software the safest software?
Generally, yes. Open-source software has more eyes on it, and it is more transparent than proprietary software. However, like all software, it still has vulnerabilities. Some of the most infamous data breaches have occurred due to vulnerabilities in open-source software, such as Apache Struts and OpenSSL. Hacking open-source software also has a bigger payoff because many more people use it. Just like all software components, Python packages can also contain vulnerabilities. If an organization is not actively monitoring for vulnerabilities, it is very likely they will creep into their models and applications over time.
What should I look out for?
Enterprise data scientists should check all packages to be used in company projects against a CVE database to ensure they are low-risk. When someone finds a CVE, they report it to a CVE Numbering Authority (CNA). CNAs assign identification numbers to CVEs and list them in publicly accessible databases. Many IT and software development teams refer to the National Institute of Security and Technology’s database (NIST) for updates. There are thousands of new vulnerabilities reported each year. Each vulnerability listed in a CVE database has a score from .1 to 10, 10 being the highest risk level. These scores are based on exploitability, impact, remediation level, report confidence, and other qualities. To better understand how a CVE score is derived, read this documentation from FIRST that describes the scoring system in detail.
Your DevOps team may have already determined what range of scores are acceptable for your company. Talk to your CISO or DevOps manager to see if a threshold has already been set. Determine your risk threshold and avoid downloading any packages with CVE scores outside the threshold. CVE scores will also help you determine how you want to go about managing threats (remediation) and how to prioritize releases.
In addition to checking CVE scores, your team should be evaluating the reputation of each piece of software you’re interested in using. Some guidelines for doing this include checking how many contributors are working on the project and the cadence of release history. Is it fairly consistent? How active is the code base? This is important because a vulnerability score alone doesn’t mean you should or shouldn’t use a package. It also matters how fast a vulnerability can be patched. You should also ensure you use the most recent version.
This all sounds very time-consuming…
Yes, it can be. We have heard from data science teams that do manual checks against a CVE database, taking hundreds of hours away annually that could be better spent building, training, and deploying and models. Even though it’s time-consuming, companies in highly regulated industries, such as finance and healthcare, take the time because the risk is too high for them. The good news is, there are tools that automate the CVE monitoring process.
DevOps teams use vulnerability scanners and managed repositories to automate governance of packages and artifacts. These tools let them know how risky a package is and enable them to blocklist it if necessary. Data science teams need similar tools. That’s why we created Anaconda Team Edition, a mirrored repository for data science teams to scan for the latest Conda and other package vulnerabilities and block or safelist packages according to enterprise security standards. Our mirrored repository has helped companies in the most highly regulated industries to confidently harness open-source innovation in an efficient way.