October 7, 2019 in Analytics
Unchaining data scientists to make DataOps work
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2019.06.03
DataOps (data operations) – an automated, process-oriented methodology for optimizing the rapid collection, integration, analysis, security and integrity of data – needs to be standard practice at any insight-driven organization. DataOps improves the quality and reduces the cycle time of data analytics, while upholding security, compliance and governance policies. When implemented correctly, the practice increases the business value of data science by supporting cross-functional collaboration and accelerating time to market.
Unfortunately, the majority of organizations attempting to implement a DataOps culture have been unsuccessful. In fact, according to Forrester, only 22% of enterprise companies are currently seeing a significant ROI from any data science expenditures. There are a number of reasons for this, which we explore in this article.
Why We Can’t Get DataOps Right for Data Science
Data science is only as good as the data it’s working with. Therefore, data scientists and their analytics, business intelligence (BI) and machine learning tools must rely on robust and thorough data preparation. Although automated machine learning has safeguards to prevent common mistakes and can work with imperfect data, the results won't be optimal. Data scientists can't glean the most accurate information and insights from amalgamated data without carefully curated data sources. This means great attention and care must be given to preparing the best possible datasets.
Meanwhile, many companies have hastily adopted a mishmash of different BI tools without a clear strategy as to how they will fit into their broader analytics environment. Most enterprise-level companies use a number of different BI tools, and each of these tools has its own query language, resulting in different query results between different tools. This diminishes the accuracy and reliability of data analyses.
On top of this, data is scattered across organizations in disparate formats and systems, making it difficult to locate, access and integrate for analysis. Some of it may be in the cloud, some of it may be in on-premises servers, and it is often in different formats and governed by different policies and security practices.
The proliferation of so many different tools, limited access to data, difficulty aligning and integrating siloed data from disparate sources, slow query performance, and data security and governance challenges make it difficult for data scientists to successfully do their job – and prevent DataOps from swiftly generating trustworthy and valuable data.
So, how is it possible to untangle so many deeply ingrained organizational structures and previous investments without a long, complicated, tedious and very expensive overhaul and/or migration of all of your current systems, tools and processes? Would going through such an ordeal even be worth it to optimize your DataOps? Can’t data scientists make do with what they have?
Making DataOps Work is Easier Than You Think
Integrating data science activities into a holistic, collaborative process of business decision-making, and empowering your data scientists to do what they are meant to do, doesn’t have to break the bank. Nor does it have to mean investing in new systems or BI tools, expanding your investment in data engineering or lifting and shifting any of your data. To achieve this, many organizations are leveraging an adaptive analytics fabric.
Adaptive analytics is a new, source-agnostic approach to accessing data in all formats, in legacy systems or in the cloud, without having to move it or transform it in any way. This makes all of an organization’s data available for collection and analysis. Adaptive analytics creates a platform- and BI tool-agnostic environment that removes burdensome technical requirements and lets DataOps concentrate on mining value from data. It accomplishes this in a number of ways, including:
- Presenting all data in one view
- Accelerating queries
- BI/OA tool-agnostic access
- Ensuring data security at rest and in flight
Presenting all data in one view: When finding and accessing the data you need is cumbersome and time-consuming, DataOps runs the risk of producing insights that are inaccurate or incomplete. It’s easy to fall prey to the temptation to gather data in the form of local extracts that will go stale, get interpreted inconsistently, or even get leaked or stolen. When this happens, the insights derived and the conclusions drawn will likely lead to suboptimal decisions being made and executed upon, possibly putting the organization in a disadvantageous position.
With an adaptive analytics fabric, all of your data, no matter what format it’s in or where it resides, is readable and accessible by all departments. The data stays where it is, but it is translated into a common business language and is accessible via a single source of truth, presenting all of your data in a single view.
With a single, reliable location that is easily accessible, data scientists don't have to create their own local extracts and private data stores. All data sets from all data sources can be seamlessly integrated, facilitating a comprehensive view of company data. Data scientists can leverage this superior shared data intellect and make their recommendations knowing that they are in alignment with their colleagues.
Accelerating queries: The swiftness and agility an organization possesses when it comes to decision-making depends, in part, on the amount of time it takes to complete data queries. But queries on databases with billions of records can take hours or even days to return. When real-world business conditions can change faster than the time it takes for your queries to return, how will you be able to keep up with your industry peers that have more solid, efficient DataOps?
With an adaptive analytics fabric, autonomous data engineering uses machine learning (ML) to query against data sets in the enterprise data warehouse. ML technology determines what data is necessary, and what data is superfluous and can be bypassed. The fabric then substitutes optimized acceleration structures for the raw data in queries. This results in significant time savings, delivering five to 40 times faster query returns.
Rather than having to rely on IT to provision data for them, data scientists can integrate data in any combination they wish and use all the organization's available data to build models, since it's all now integratable. This means that data scientists who were very careful about their lines of inquiry because it took so long to build models and get results can now investigate many more ideas and experiment with new avenues of research.
BI/OA tool-agnostic access: Everyone has their own personal preference when it comes to BI/OA tools. Whether it’s an individual choice or the standards of each department, most enterprise-level companies and their data scientists use a number of different tools.
However, supporting a variety of different BI tools can be very challenging, and often results in divergent query results. Different BI/OA tools have different query languages and display data in slightly different ways. When data with incongruent definitions are combined without being normalized, costly errors in analysis can occur, even when the underlying data is the same. In order to ensure that information is accurate no matter what tool performs the analysis, enterprises must be able to normalize data with a common enterprise business logic that is legible to multiple different BI/OA tools.
With adaptive analytics, data scientists can use any BI/OA tool, because the differences are automatically normalized with a semantic translation layer. No longer does one have to bend all users to a single standard for BI software. Disparate data sets can be accessed, integrated and analyzed with any BI/OA tool a data scientist or organization wishes to use, and data and queries will always return consistent answers.
Ensuring data security at rest and in flight: The ability to access and integrate all data sets with any BI/OA tool you wish is a game-changer for DataOps. However, when accessing data, security leakage can occur when enterprises utilize connection pools for BI tools or depend on security aggregation systems. DataOps is supposed to provide valuable business insights in a much faster time window, but it is never wise to do this at the expense of privacy and security.
An adaptive analytics fabric eliminates this potential security risk by checking security requirements at the source databases and applying those requirements to query results. User identities are tracked and verified, even when accessing data through a connection pool. The various security policies from all data sources are collected and merged to filter results appropriately. These same security policies are applied to data aggregates, eliminating unintentional exposure of restricted or private data.
An adaptive analytics fabric also uses best-of-breed security practices wherever possible, such as End-to-End TLS to protect data in flight; LDAPS, Active Directory, IdP and SAML for authentication; and JWT, CORS and REST for API access.
Remove the Barriers to Your DataOps
DataOps cannot generate real, significant value unless your environment is free of a number of technical and organizational barriers. Adaptive analytics is the fastest and easiest way to break down those barriers and allow data scientists to focus on their core competencies while cultivating an agile, iterative workflow. With a clear path forward, DataOps can fully leverage data in real time across all business units for new products, services, and customer and market insights to drive growth and outpace the competition.
Matt Baird is co-founder and CTO of AtScale.