September 6, 2024 in Data Quality

Using Six Sigma to Measure and Improve Data Quality

SHARE: PRINT ARTICLE:print this page https://doi.org/10.1287/orms.2024.03.14

The most important principle in software engineering since about the 1950s is “GIGO” (garbage in, garbage out) [1]. This principle has wide applicability beyond software development. The quality of the output is only as good as the quality of the input. There is no point in applying super-sophisticated algorithms to make inferences and forecasts and draw conclusions if the input data is questionable.

Data drives decisions, informs strategies and supports operations across all sectors. Poor data quality can lead to significant business losses and risks [2]. For example, the poor decisions made by real estate company Zillow because of incorrect analytic models ended up costing the company $304 million in write-downs [3]. Although strategic decisions succeed or fail based on many criteria besides bad data, the quality of data does play an important role. For example, Target’s failed expansion into Canada was partly because of poor data quality in their inventory system, leading to stockouts and overstocking, which contributed to their $5.4 billion loss and eventual withdrawal from the market [4]. Similarly, J.P. Morgan’s $6.2 billion trading loss in the “London Whale” incident was exacerbated by inaccurate risk modeling data [5].

This article, based on successful programs at large companies, explores how Six Sigma, a well-established methodology for process improvement, can be used in an innovative way to enhance data quality.

Defining Data Quality

Data quality refers to factors such as accuracy, completeness, reliability, recency and relevance. Types of errors that can compromise data accuracy and reliability can be classified into three broad groups:

  1. Database errors include errors in format and constraints. For example, dates can be in either “MM/DD/YYYY” or “DD/MM/YYYY” formats. Inconsistencies arise when dates are stored in these two formats in the same data table. Constraint errors happen when the data values violate one or more business rules or constraints. For example, a user may fill in a fake email address (such as [email protected]), fake ZIP code (such as 11111) or nonresidential phone number (such as 1-800-555-1212).
  2. Outliers are data points significantly different from others in the data set. A general rule of thumb for a numerical outlier is any value that is 1.5 times the interquartile range, above the third quartile or below the first quartile. Outliers either point to a change in trend, a genuine anomaly, or an error in the transaction processing system. For instance, an age field showing an age higher than 110, an income field showing an individual annual income of $100 billion, or an age field showing a value less than 18 on a financial contract (because minors cannot enter into legal contracts) would be considered outliers.
  3. Semantic inconsistencies are generally very subtle. They occur when a data field is used for a purpose for which it was not originally intended (such as adding marketing discount plan codes into a field meant to hold vendor identifiers or allocating a range of customer identifiers to indicate high-value customers).

Measuring Data Quality

Measuring data quality involves evaluating various attributes of data, such as:

  • Accuracy: Does the data correctly describe the real-world object or event, including the absence of errors of various types?
  • Completeness: Is the data telling you only part of the truth or the complete truth? Usually, this has to do with the complete data record rather than just one field. For example, when demographic data is collected, some of the input data may be optional. The resulting data set may have gaps in the complete demographic profile. Another subtle defect is that the demographic data inputs might be fake. In other words, the data may tell the whole truth and also some untruths.
  • Reliability: Data may be complete and accurate, but is it reliable? Is the data biased? Is the variance too high?
  • Relevance: Data may be accurate, complete and relevant; however, is it relevant to solve the problem at hand? Is it relevant for decision-making or operational efficiency?
  • Consistency: Is there an absence of differences when comparing two or more representations of the same data?
  • Recency: How up-to-date and available is the data? The quality of the data might be perfect, but if it is outdated, it would result in operational and analytical issues.

Methods to Improve Data Quality

There are many approaches to dealing with poor data quality. The best way is to prevent poor quality of data in the first place. This is generally very difficult to achieve, especially in larger companies that grow through mergers and acquisitions (M&A). Errors in data may be detected during business operations and when performing analysis (in companies that have an active analytics program). A significant portion of an analyst’s time is spent in cleaning the data. There are three general strategies for dealing with bad data:

  1. Fix data errors for analysis only: Clean the data before using it for analysis. It is a sign of data analytic maturity to report data errors to the business operations stakeholders so they can be fixed at the source as often as possible.
  2. Fix errors in source systems: Prevent errors at the source or correct them as soon as possible. This ensures that data remains accurate over time and across different uses.
  3. Prioritize data fields: Given limited resources, it is essential to prioritize which data fields to clean up. This can be challenging and requires understanding the data that is most critical to business operations. Six Sigma provides a way to help prioritize data fields for fixing quality issues.

Six Sigma Methodology for Improving Data Quality

Six Sigma is a data-driven methodology for eliminating defects and improving processes in any organization. The methodology is named for “six standard deviations,” or 3.4 defects per million opportunities (DPMO). DPMO measures process performance, quantifying the number of defects in a process per 1 million opportunities (ways in which data can be defective) [6]. Here is a step-by-step guide for effectively using Six Sigma for improving data quality:

  1. Identify the most critical business area. This involves discussions with the business stakeholders and analyzing the impact of each business area on the metrics, such as market share, profitability, customer satisfaction, operational excellence and regulatory compliance.
  2. Identify all data fields that contribute to the calculation of the business metrics that the chosen business area directly impacts. This helps the company focus on the data fields that impact the most important business area.
  3. For each data field:
    1. Define the impact of bad-quality data on the business metric. For example, if the data had a 1% error rate, what would its impact be on the business metric?
    2. Define what constitutes an error. Using this definition, calculate the number of faulty instances of that data field. Note that the DPMO calculation is modified here to focus on one field at a time, rather than a data record (or unit of processing), which is the usual procedure for calculating the DPMO for a process. This modification is necessary because the goal here is to improve data quality rather than process quality.
    3. Calculate the proportion of defectives (p = No. of defective instances / No. of all instances).
    4. Calculate the DPMO of the data field (d = p × 1,000,000). The DPMO provides an intuitive feel for the number of defects in every million opportunities.
    5. Calculate the Sigma level using the Excel formula, s = norm.s.inv(1 – p) + 1.5, or estimate it by using a Sigma level lookup table.
    6. Determine the actual impact b of the DPMO or Sigma level of the data field on the business metric.
  4. Repeat Step 3 for each data field. When done, this process yields a set of b-values for all the data fields that impact the key business metrics for the given business area.
  5. Prioritize the list of b-values and start with determining ways to improve the quality of the data whose poor quality impacts the business area the most. These strategies should be primarily directed at improving the data intake during business operations.

Although calculating DPMO seems redundant or overly complicated compared with just the error rate, DPMO is a common measure of process quality. It provides a common basis for comparing data quality across different data fields and the impact of bad data on processes. Experience shows that this method results in healthy discussion and debate about the value of data quality and its impact on business processes.

Six Sigma is naively used to suggest that all processes and data should be at Six Sigma quality. In practice, this is neither true nor necessary. For example, airline safety is generally above the sigma level of 6.0 [7], whereas airline baggage handling is approximately 4.0 [8]. The target sigma level should be determined by business considerations. For example, it is much more important for the airline industry to deploy limited resources to improve airline safety than baggage handling.

Steps 3-5 above should be repeated for the rest of the data fields that have not been addressed. Moreover, this is a continual, iterative, endless process of data quality improvement. It is never ending because business processes change, quality controls may fail or trends may shift. In larger companies, the challenge of integrating new data acquired through M&A activity always exists.

Conclusion

Six Sigma and DPMO are practical and powerful methods for improving data quality. By prioritizing business functions and focusing on key data fields, companies can achieve significant improvements in data quality. This not only enhances decision-making but also reduces risks and operational inefficiencies. Embracing Six Sigma for data quality management is a strategic move that can drive long-term business success.

References

  1. https://www.techtarget.com/searchsoftwarequality/definition/garbage-in-garbage-out
  2. Susan Moore, 2018, “How to stop data quality undermining your business,” Gartner, January 18, https://www.gartner.com/smarterwithgartner/how-to-stop-data-quality-undermining-your-business.
  3. Rachel Metz, 2021, “Zillow’s home-buying debacle shows how hard it is to use AI to value real estate,” CNN Business, November 9, https://edition.cnn.com/2021/11/09/tech/zillow-ibuying-home-zestimate/index.html.
  4. Joe Castaldo, 2016, “The last days of Target,” Canadian Business, January 1, https://canadianbusiness.com/ideas/the-last-days-of-target-canada/.
  5. https://flowandebb.com/insights/learning-from-the-london-whale/
  6. https://www.sixsigma-institute.org/What_Is_Sigma_And_Why_Is_It_Six_Sigma.php
  7. IATA, 2024, “2023 safest year for flying by several parameters,” IATA Pressroom, February 28, https://www.iata.org/en/pressroom/2024-releases/2024-02-28-01/.
  8. Robert Silk, 2024, “Airlines’ baggage handling improved in 2023,” Travel Weekly, June 17, https://www.travelweekly.com/Travel-News/Airline-News/Airline-baggage-handling-improved-2023

Kiran Garimella

SHARE:

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.