December 17, 2019 in Data Science

Taking the “Wrangling” Out of Data Science

Pablo Álvarez

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2020.01.01

The world as we know it would be a very different place without data science. All around us, data science algorithms are being deployed against ever-increasing volumes of data to anonymously enable a variety of tasks like searching the Internet or finding online images to more complex ones like uncovering artificial intelligence (AI), solutions that enable autonomous driving vehicles or accelerating the discovery of new cures for disease.

Data science and the disciplines of AI, machine learning (ML) and deep learning all follow a similar lifecycle – defining the business objectives, mining the data, cleansing the data, exploring the data, engineering features using raw data, training ML models and finally, communicating their findings. In fact, more than half of the steps in the data science lifecycle involve harnessing and manipulating the vast amounts of data in one shape or form. Data scientists appreciate this volume, as they require significant amounts of data to build and train models that deliver accurate results, but they don’t appreciate the effort that is needed to make the data usable.

Why Data Scientists Were Never Meant to be Cowboys

Sometimes described as “wrangling,” the challenges data scientists face in acquiring and preparing data can be daunting. First and foremost, they need to identify where the right data is located and determine if they have appropriate access to the data, both from an authorization standpoint as well as the necessary technology. Once accessed, they need to clean, query and analyze the data, and finally fix any missing or incomplete data. They may also want to expand their original data sets, looking for correlations within their CRM, enterprise data warehouse or even external data (e.g., weather), clickstreams, etc.

This process of data acquisition and manipulation has drawn some interesting comparisons such as “wrangling,” which brings to mind what cowboys do to organize a herd of cattle. The problem for data scientists is their “herd” of data isn’t restricted to “cattle” but rather is often quite heterogeneous, coming from different systems, multiple protocols and data formats. Ultimately, this inhibits productivity and slows down the data science lifecycle. It is this type of data bottleneck that contributes to data scientists spending 80% of their time getting data prepared and only 20% of the time on actual data science.

Unless data scientists are already in possession of the perfect data set, there will come a time when they will have to acquire data. Once located, the most common method for retrieving the data is to copy, extract and load the data into the desired data repository, normally a data lake. The generation and maintenance of this ingestion pipeline can not only be time-consuming, it is also costly.

On one hand, it needs to adapt to the format and nature of the original source. On the other, once created, it needs to be maintained and governed, otherwise you run the risk of the data lake becoming unusable. The problem with this approach is that the usability of that data is not guaranteed until further analysis is done. Which begs the question: Is it worth it to format, copy and maintain ingestion pipelines for data that isn’t really useful?

Leveraging Data Virtualization to Help Corral Data

An alternative approach to data acquisition that is gaining popularity is data virtualization (DV), a method of data integration that relies on metadata to create virtual data “views” of disparate data sources. This process eliminates the need to physically move data and therefore reduces the time needed to analyze it, reformat it and look for correlations with other data sets. Not only does DV accelerate data acquisition, it also provides a more streamlined, unified access point for data. This virtual data access layer abstracts the technologies underneath offering a standard SQL interface to query and manipulate. The data is then analyzed, and after being identified as useful, it can be easily moved into the data lake with one click. Typically, and even after the right data is found, data scientists will want to modify it in order to meet their specific need. Using data virtualization, the data scientist can utilize standard SQL operations (joins, aggregations and transformation) to modify their data sets.

To further accelerate this phase, data scientists will often turn to a data catalog, which provides documentation, search and exploration capabilities to the available data sets and metadata around a topic. This allows for better reusability, as data that is beneficial for a particular algorithm can also be valuable to others. Cataloguing capabilities on the data used by data scientists, regardless of their location, will foster collaboration and further simplify the journey to find the right data. Metadata can be annotated to better describe the models and communicate collaboration features that enable sharing, endorsements, warning messages, etc. Within a data catalog, a data scientist can be assured they are working with data assets that are known and trusted, as additional information can be derived by exploring the composition and lineage of data stored in the catalog to ensure it is the right data coming from the right place.

Benefits of Reigning in Models and Operationalize Algorithms

During the development of a data science model, data scientists normally use workbenches like R Studio, Apache Zeppelin or Jupyter. These web-based platforms enable data scientists to interactively develop models and experiment with their data science projects. These tools can easily integrate with a data virtualization layer, normally through SQL via JDBC/ODBC connections. Some data virtualization tools come equipped with their own notebook to enable immediate access to the logical data models. This real-time access makes it faster and easier for data scientists to work with the data.

At the end of this lifecycle, data scientists need to operationalize their algorithms and models. The predictions they generate are only as useful as the decisions that they enable. Therefore, they need to find their way to the dashboards of less technical business users. This is often a challenge, as those users often lack proficiency in Python and won’t use a notebook. Again, a virtualization layer, with its abstraction capabilities, can be a good ally of data scientists in this phase. For example, the model might be created using something like Python, but once the model is created, it can be published as a JSON API. The model now becomes a new data source for the data virtualization tool, can be secured and then easily consumed by standard reporting tools like Tableau or Power BI.

As we have seen, data virtualization platforms can play a very important role in the toolbox of the data scientists. They help accelerate and simplify data exploration, acquisition and, yes, even wrangling, thanks to its abstraction capabilities on top of any data sources. DV solutions can also help reuse and document existing data sets. Finally, and perhaps most importantly, they can help expose the results of the data science process to a greater audience. That may be why data scientists are doing away with the lasso in favor of the data virtualization “swiss army knife” to gain hold of their data landscape.

Pablo Álvarez

Pablo Álvarez is the director of Product Management at Denodo, a provider of data virtualization software.

Keywords: