September 13, 2023 in Machine Learning

4 Machine Learning Model Testing Criteria That Drive AI Performance

Shayak Sen

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2023.04.02

Artificial intelligence (AI) has a quality problem. And it’s big.

According to Gartner, 54% of machine learning (ML) models never make it into production, primarily because of poor quality. Sometimes, the models don’t work as intended, and developers can’t identify the cause to fix it. Other times, the models work but aren’t explainable, so they can’t be deployed for compliance reasons. Sometimes, even when the model’s results are explainable, stakeholders don’t trust them and they aren’t deployed.

Enterprise software experienced a similar phenomenon in the 1990s. However, automated testing and monitoring emerged, and by ensuring quality, these techniques enabled enterprise software use to skyrocket. With the advent of generative AI, this has become even more important.

We’re at a similar inflection point with AI and ML models.

What Developers Need to Do to Ensure Model Quality

There are four important ML model testing criteria: performance, drift, bias/fairness and feature importance. Developers need all four to ensure their models make it over the finish line.

Performance testing focuses on the correctness of a model prediction for a data set by evaluating the model against accuracy metrics commonly used by data scientists. Aspects to evaluate in performance testing include:

Segments: Assessing performance across specific data segments that are important to the use case. In many situations, although overall performance might be fine, bad performance in small pockets of data can have an oversized impact on the use case.
Multiple criteria: Instead of using a single metric, additional criteria can include false positives, true positives, false negatives, true negatives, AUC, accuracy, recall, etc. For generative use cases, it is critical to take into account additional factors such as relevance, toxicity and sentiment.
Human feedback: Especially for generative AI, it is critical to keep human feedback in mind because humans are eventual judges of the quality of these applications.

Drift testing focuses on comparing two different data sets (most often a training set versus a test set) and ensuring that the model doesn’t behave unexpectedly when placed in a different environment. You’re essentially looking for decay in a model’s predictive power resulting from a change in the real-world environment. For example, during the COVID-19 pandemic, travel ground to a halt and many credit-decisioning models broke because so many were dependent on travel frequency as a variable. As a rule, drift in model input leads to drift in model output.

There are many types of model drift, including covariate, concept and data drift. One way to measure drift is with Model Score Instability (MSI), a type of drift in machine learning models that refers to changes in the distribution of a model’s score over time or across different data sets. MSI encompasses looking at Wasserstein distance, Population Stability Index (PSI) and difference of mean.

For generative use cases, drift can be calculated in an embedding space: a numerical space in which unstructured data is converted to measurable numbers.

Bias/fairness testing focuses on protected groups, particularly those with a limited amount of data, and ensuring that the outcomes/results of a model are not skewed against them. There are numerous examples of bias in ML models. For instance, in November 2019, an Apple Card user noted that Apple’s AI algorithm granted him 20 times the credit limit that his wife received. This disparity came as a major surprise because the couple shared assets and she had a higher credit score than he did.

Testing needs to ensure that protected groups don’t receive unfavorable treatment as the result of a model’s prediction. Specifically, testers need to determine the disparity of outcomes of an identified group versus the rest of the population and evaluate the statistical parity difference and true positive/negative ratio. Often, developers can substitute variables that may otherwise lead to disparities, such as race, sex, age or even ZIP code.

Data quality testing focuses on data elements that mislead a model to ensure that models are built on high-integrity data. This is especially critical for generative AI use cases that heavily depend on the quality of data sources provided during fine-tune or prompt testing.

ML teams that perform all four types of tests will be best positioned to improve their deployment rate and get more models into production – delivering on the promise of AI at their organizations.

Testing throughout the model development cycle is crucial to ensure the quality and reliability of machine learning models. By testing early and frequently, data scientists and ML engineers can identify and address issues before they become more difficult and expensive to fix.

Shayak Sen

Shayak Sen is cofounder and chief technology officer of TruEra.

Keywords: