November 11, 2020 in Business Analytics

Data Science Productization Drives Business Value

The many benefits, risks and challenges of building and deploying data science products.

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2020.06.15

The phenomenal volume and velocity of data generated in this digital world is exponentially increasing. Today, data collected across multiple feeds is heterogenous, complex and nonintuitive. The ultimate objective of collecting, storing and analyzing data at Walmart is to deliver value to customers, associates and our business. Value derived from data can be tangible, leading to increased sales and a higher return on investment, or intangible, leading to brand uplift and customer retention.

One of the proven ways organizations can realize value from their data is by building and deploying data science products. “Productizing” data science is a journey that involves translation of insights obtained from exploratory analysis into scalable models that can power data products. This involves focusing on deploying models into production systems and effectively automating and scaling them. Driving value through creation of data science products helps organizations climb the data and analytics value chain to attain significant data system maturity.

Productizing data science can have many benefits. It can help embed data science across all enterprise products and democratize its applications by placing it in the hands of even nontechnical business users. This can help drive higher adoption of data science across the enterprise.

The other big advantage of productizing data science is enhancing scale. Institutionalization of data science products can help organizations take advantage of scale. For example, the market mix models built for general merchandising, if turned into a product, can be leveraged for online grocery shopping and across different geographies. Such institutionalization of products can enhance efficiency and limit redundancy. Data science productization can generate cost savings for the organizations by reducing duplication of efforts and facilitating reuse by leveraging machine learning models as well as production systems for diversified applications.

Walmart’s Core Principles

Walmart sign exterior — Image Source: Walmart Inc.

Walmart, the world’s largest company by revenue per the Fortune 500 list, is a huge organization, servicing nearly 265 million customers each week. Each day, huge amounts of data – petabytes in size at any given time of the year – are generated across our business channels. Analyzing such massive amounts of data to derive insights requires building data science products to leverage this scale, reuse solutions for different applications and automate. We operate on the core principle of creating and realizing value from data while maintaining customer trust.

One of the biggest advantages of productizing data science is that it can help organizations utilize their scarce data science resources to do niche data science work, freeing them from mundane, repetitive tasks. According to Indeed, a top online site, the demand for data scientists year over year jumped 29% in 2020 and 344% since 2013. As a result, the supply of such skilled workers exceeds the demand. Data science productization can help organizations retain data scientists by keeping them motivated with tasks they enjoy most, such as building and institutionalizing models for novel applications and solving unique problems.

While productizing data science offers many benefits to organizations, it also presents numerous risks and challenges, including the following:

Risk of losing data science resource investments: According to the Project Management Institute (PMI), approximately 14% of corporate projects fail, 31% of all corporate projects do not meet their goals, 43% exceed their initial budget and 49% do not meet timelines. The best solution: invest a limited amount of time and resources (including money) and complete a pilot phase. The alternative, disadvantageous plan – invest a lot of resources, complete the project end-to-end and deploy the models in production environments only to find the results are not good enough – exposes organizations to unnecessary, costly risk.

Risk of concept drift: Machine learning (ML) algorithms not connected to the constant feed of new data make predictions that are less useful and accurate as time goes on. As a result, statistical properties of the target variable that the model is trying to predict become less accurate. This phenomenon is known as concept drift. Machine learning models are designed to gain intelligence by analyzing new scenarios from the dynamic data feed. If the new data feed is not connected to these algorithms, they degrade in quality. One way to address this risk is to use Python’s Scikit-Multiflow library, an algorithm known as adaptive windowing to help identify data drift.

Risk of creating dysfunctional teams: Over the years, the software industry has matured, and software engineering roles have gained sophistication in operating production applications and services. However, deployment of machine-learning products requires strong data science skills along with software engineering skills. Expecting software engineers to deploy data science products in a production environment without any additional training and supervision is setting them up for failure. Machine-learning engineers, who possess a combination of both data science and data engineering skill sets, are needed for this task. For effective deployment, these ML engineers should work closely with the software engineers who build and maintain the production systems. The lack of such collaboration risks creating dysfunctional teams within the organization.

Risk of unintended consequences: Productizing data science may expose organizations to the risk of perpetuating and amplifying data bias, leading to unintended consequences in applications. Artificial intelligence algorithms are the statistical representation of the data and the world we live in. They are not rule-based, hence they learn from scenarios present in the datasets on which they are trained. If the training dataset has scenarios that imply bias, they are reflected in the model outcome. Eli Pariser created a phrase “filter bubble” to explain one similar phenomenon on hyper personalization. He indicated that hyper personalization models re-enforce the same interests and belief systems that can institutionalize the bias and leave people in the bubble.

While some artificial intelligence algorithms can be interpreted to understand how they’re making predictions, others remain a black box so even industry experts can’t interpret them. This lack of interpretability and transparency of algorithms creates concerns of unintended consequences regarding productization across the organization. Mitigating these risks is essential to derive value from data science productization. Organizations need to ensure effective collaboration between data science, product management, legal, ethical and engineering teams. These teams need to work as a single unit and in a seamless manner to create value.

To effectively realize value from data science productization, organizations need to focus on the following:

Data science architectural designs: Value creation from data requires effective decision-making at scale in order to design and deploy data science solutions that lead to actionable recommendations. For effective model deployment, the data architecture needed for production systems should be designed in the pilot phase. Machine learning models need to be flexible to accommodate baseline scenarios without major redesign.

Computing resources are generally scarce, so the compute performance of models should be a major consideration while deploying such models in a production environment. Data flows are another important consideration. Models can be trained in Python offline, translated using Predictive Model Markup Language and deployed using Google Dataflow. An open source project called ML flow with Azure Machine Learning that tracks metrics can also be a robust combination to scale data science products. Dynamic data flows need to be connected to the model for these algorithms to perform accurately. Model deviations occurring because of broken data feeds are difficult to detect in the production environment and are immune to conventional software testing tools.

Data governance framework: An effective data governance and regulatory framework is required to steer data science applications in the direction of value creation. As data regulation continues to evolve, data science models need to be compliant with international regulatory norms such as General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). A 2019 consumer privacy survey by Cisco found that 84% of the survey respondents cared about privacy when it comes to not just their own data but of other members of society as well. They also want more control over how their data is being used. Of this group, 80% also said they are willing to act to protect it.

The GDPR and CCPA regulations impose restrictions on the way organizations can collect and process customer and personal data. These norms suggest consent should be sought from customers on the usage of their data and requires organizations to use jargon-free language in marketing campaigns to ensure transparency in messaging. This would influence the model development and deployment process to a great extent. A lack of compliancy to regulatory frameworks jeopardizes the entire data science roadmap and can lead to legal complications and severe brand and reputation damage.

Unintended implications: An important unintended consequence of productizing data science through applications such as chatbots could be loss of human control and empathy. These chatbots interface with customers and most have standard responses to questions based on scenarios. There may be scenarios when algorithms do not show empathy, which may create a customer friction point. By institutionalizing the machine learning algorithms as products, organizations may lose human control on operations to a certain extent, and any error in the production system can have far reaching negative consequences.

To effectively create value from data, organizations need to proactively mitigate unintended consequences associated with data usage and model production. Most algorithms are not biased, but the underlying data is. Preprocessing data to identify skewness in variables and training algorithms with unbiased datasets are the first steps in achieving fairness. To address this problem, Python offers libraries such as FairNN and Fairness.PyPI (Parity Fairness), while R offers a fairness package available for download through CRAN. These libraries and packages offer computation of fairness metrics and comparison of these metrics across population subgroups. The next step is to effectively process the data transformations, ensuring the variable imputations do not pick up bias in gender, ethnicity or geography.

Peak of Revolution

We are on the peak of the data science revolution. Artificial intelligence applications are increasing across all industries. We have enormous opportunities ahead of us and massive risks underlying these opportunities. Productizing data science is needed to help organizations move up the data value chain to leverage the benefits of scale and automation. However, identifying the data science product risks early in the lifecycle and minimizing them is important to avoid unintended consequences. This will help create value from data and achieve sustainable outcomes that will not just benefit organizations but humanity.

References

https://azure.microsoft.com/en-us/blog/make-your-data-science-workflow-efficient-and-reproducible-with-mlflow/
https://www.datapred.com/blog/productizing-machine-learning-models-what-is-required
Andrews Kelleher and Adam Kelleher, 2019, “Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications, 1st edition,” Addison-Wesley.

Srujana Kaddevarmuth

Srujana Kaddevarmuth is director of Data Science & Value Realization at Walmart Inc., where she leads a data science portfolio to solve unique business use cases and drive quantifiable data value. Previously, Kaddevarmuth held positions with Accenture and Hewlett Packard, all in data science leadership capacities. Along with a master’s degree in operation research, she completed executive programs in analytics strategy management at Harvard University and for women leaders at Stanford University. She is on the governing council of Analytical Society of India and co-founder of Women in Machine Learning and Data Science (WiMLDS), Bengaluru chapter. She is also a Women in Data Science (WiDS Stanford University initiative) ambassador.

Bill Groves

Bill Groves, senior vice president and chief data officer at Walmart Inc., is a technology leader and executive with more than 20 years of expertise in technology and data-enabled business solutions space. Prior to joining Walmart, he was chief data scientist and artificial intelligence officer at Honeywell. He also serves as a faculty member at International Institute for Analytics. He has an MBA in innovation and technology management from the University of Delaware and has been on the advisory board for a few well-known consumer forums including the SAS Analytics consumer advisory board.

Keywords: