“If you want to understand function, study structure.”

Sage advice from Francis Crick, who revolutionized genetics with his Nobel Prize winning co-discovery of the structure of DNA — launching more than six decades of fruitful research.

Crick was referring to biology, but today’s companies competing in the Big Data space should heed his advice. With change at a pace this intense, understanding and optimizing one’s data science infrastructure — and therefore functionality — makes all the difference.

“If you want to understand function, study structure.”

Sage advice from Francis Crick, who revolutionized genetics with his Nobel Prize winning co-discovery of the structure of DNA — launching more than six decades of fruitful research.

Crick was referring to biology, but today’s companies competing in the Big Data space should heed his advice. With change at a pace this intense, understanding and optimizing one’s data science infrastructure — and therefore functionality — makes all the difference.

But, what’s the best way to do that?

Fortunately, there’s an ideal solution for evolving in a rapidly-changing context while generating competitive insights from today’s deluge of data.

That solution is an emerging movement called Open Data Science, which uses open source software to drive cutting-edge analytics that go far beyond what traditional proprietary data software can provide.

Shoring up Your Infrastructure

Open Data Science draws its power from four fundamental principles: accessibility, innovation, interoperability and transparency. These insure source code that’s accessible for the whole team — free from licensing restrictions or vendor release schedules — and works seamlessly with other tools.

Because open source libraries are free, the barrier to entry is very low, allowing teams to dive in and freely experiment without the concerns of a massive financial commitment up front, which encourages innovation.

Although transitioning to a new analytics infrastructure is never trivial, the community spirit of open source software and Open Data Science’s commitment to interoperability makes it quite manageable.

Anaconda, for example, provides over 720 well-tested Python libraries for the demands of today’s data science, all available from a single install. Business analysts can be brought on board with Anaconda Fusion, providing access to data analysis functions in Python within the familiar Excel interface.

With connectors to other languages, integration of legacy code, HPC and parallel computing, as well as visualizations easily deployed to the web, there’s no limit to what can be achieved with Open Data Science. 

Navigating Potential Pitfalls

With traditional solutions, unforeseen limits can bring the train to a screeching halt.

I know of a large government project that convened many experts to creatively solve problems using data. The agency had invested in a many node compute cluster with attached GPUs. But when the experts arrived, the software installed was not inclusive and allowed less than a third of them to actually use it.

Organizations cannot simply buy the latest monolithic tech from vendors and expect data science to just happen.  The software must enable data scientists and play to their strengths not only to the needs of IT operations.

Unlike proprietary offerings, Open Data Science has evolved along with the Big Data revolution —and, to a significant extent, driven it. Its toolset is designed with compatibilities that drive progress.

Setting up Your Scaffolding

Making the shift to an Open Data Science infrastructure is more than just choosing software and databases. It must also include people.

Companies should provision the time and resources necessary to set up new organizational structures and provide budgets to enable these groups to work effectively.  A pilot data-exploration team, a center of excellence or an emerging technology team are all examples of models that enable organizations to begin to uncover the opportunity in their data.  As the organization grows, individual roles may change or new ones may emerge.

Details of which toolsets to use will need to be hammered out. Many developers are already familiar with common Open Data Science applications, such as data notebooks like Jupyter, while others may require more of a learning curve to implement.

Choices such as programming languages will vary by developers’ preferences and particular needs. Python is commonly used, and for good reason. It is, by far, the dominant language for scientific computing, and it integrates beautifully with Open Data Science.

Finally, well-managed migration is critical to success. Open Data Science allows for a number of options — from “co-existence” of Open Data Science with current infrastructure to piecemeal, or even full migration, all depending on a company’s tolerance for risk or willingness to commit. Legacy code can also be retained and integrated with Open Data Science wrappers, allowing old but debugged and stable code-bases to serve new duty in a modern analytics environment.

Taking Data Science to a New Level

When genetics boomed as a science in the 1950s, new insights were always on the way. But, to get the ball rolling, biologists needed to understand DNA’s structure — and exploit that understanding. Francis Crick and others began the process, and society continues to benefit.

Data Science is similarly poised on the cusp of an astounding future. Those organizations that understand their analytics infrastructure will excel in that new world, with Open Data Science as the instrument for success.