March 1, 2022 in Data and AI Trends

Four Major Synthetic Data and AI Trends for 2022

Tobias Hann

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2022.02.06

Deploying artificial intelligence (AI) in business operations such as insurance and financial services is no longer optional. Companies looking to keep their market shares are getting used to the idea of using AI tools, and 2022 will be a year with a steep learning curve. One of the most important topics C-level executives need to understand is why AI is the very tool they need for building AI. Using tools such as synthetic data generators powered by sophisticated AI engines will help companies unlock the value of sensitive customer data while keeping the privacy of their customers protected and in compliance with data protection regulations such as GDPR and CCPA. That said, the privacy use case is only one of the reasons behind the popularity of synthetic data. It can be quickly generated in abundance and has been proven to drastically improve machine learning performance. As a result, it is often used for advanced analytics and AI training, such as predictive algorithms, fraud detection and pricing models.

Synthetic data can provide better-than-real training data for AI models by upsampling rare events in training data, which helps AI models learn more efficiently and has been proven to improve model performance by as much as 15%. Sophisticated AI algorithms are generated that learn the characteristics of a real production dataset and create new artificial sets of data indiscernible from the original datasets, yet fully anonymous. With synthetic data, you can also change some of the characteristics of the original to better train models and make them more accurate. For example, you can increase the incidence of rare events in the dataset, making algorithms more efficient in learning these rare patterns (e.g., in fraud detection), or boost representation of underrepresented groups (e.g., women or people of color).

According to Gartner, by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. Responsible AI in particular cannot exist without synthetic data. The process of synthetization can rebalance datasets with embedded biases, for example, by generating more high-earning women for gender equality. The end result is 100% privacy-compliant synthetic data that is suitable for providing explainability to those building AI and those checking the model for compliance and ethics. What’s more, synthetic data can be flexibly shaped and is therefore excellent for model-agnostic AI testing. You can generate realistic or highly unlikely scenarios and see how your AI system behaves when fed with those variables.

The company MOSTLY AI pioneered the creation of synthetic data for AI model development and software testing. With this space moving so quickly, here are four trends that we expect to happen in AI and synthetic data in 2022:

1. Bias in AI will get worse before it gets better.

Interacting with customers and making decisions about people have never been audited for fairness and discrimination. The training data has never been augmented to fix embedded biases. It is only through massive scandals that companies are finding out and learning the hard way that they need to pay more attention to biased data and use fair synthetic data instead. Consumers will find themselves subjected to blatantly discriminatory decisions made by unchecked AI systems already in production. At the same time, consumers’ data and AI literacy is increasing, and the combination of those two forces will create the perfect storm in highly sensitive markets, such as banking and health insurance. Throughout 2022, expect more informed customer decisions and a demand for mass-market explainability.

2. Companies’ data assets will freeze up owing to regulations and declining customer consent.

According to the most recent developments, Germany is likely to abolish the retention of telecommunication data once and for all. It is only a matter of time before a federal data privacy law comes into force in the United States. Regulations all over the world are getting stricter – even China has a data protection policy in place now. Using customer data is getting increasingly difficult for a number of other reasons; people are more privacy conscious and are increasingly likely to refuse consent to using their data for analytics purposes. So, companies literally run out of relevant and usable data assets. Companies will learn to understand that synthetic data is the way out of this dilemma because it is the artificially generated version of data indiscernible from the original datasets without the original data points. Synthetic datasets contain all the value of the data without any sensitive information.

3. Every company that uses AI models will at least experiment with synthetic data.

Synthetic data is better than real when it comes to AI training. And it can be freely shared across teams and organizations. AI and machine learning algorithms simply perform better when trained with upsampled, augmented and bias-corrected synthetic data, picking up on patterns more efficiently without overfitting. It’s important to choose well when it comes to synthetic data generators – some perform better than others. If a generator is not accurate enough, the resulting synthetic data can lead your data science team astray. If it’s too accurate, the generator overfits or learns the training data too well and could accidentally reproduce some of the original information from the training data. Doing a thorough proof of concept (POC) with synthetic data vendors – giving attention to the particular data types your company works with frequently – really pays off. For example, if you need highly accurate and privacy-safe synthetic geolocation data, MOSTLY AI could be a great choice.

4. Synthetic data will be standardized with globally recognized benchmarks for privacy and accuracy.

Highly realistic synthetic data can come in many shapes and forms. To start, there is a world of difference between what we call structured and unstructured synthetic data. Unstructured data could mean images and text, for example, whereas structured data is mainly tabular in nature. There are many providers of open source and proprietary synthetic data out there for both kinds of synthetic data, and the quality of their generators varies widely. It’s time to establish a synthetic data standard to ensure that synthetic data users get consistently high-quality synthetic data. MOSTLY AI is already working on structured standards for synthetic data.

If we address the true issues of compliance, regulatory requirements, customer privacy, bias and performance with AI models, businesses are going to need to rely on tools such as synthetic data generators. We are beyond the point where companies see the value of AI models, but rather, they grapple with how to deploy the models in a way to drive high performance while protecting customer data and staying in compliance. This will be the biggest challenge, and 2022 will be the year that companies begin to understand how to put these pieces together.

Tobias Hann

Tobias Hann is CEO of MOSTLY AI. Before joining MOSTLY AI, Hann worked as a management consultant with the Boston Consulting Group and as co-founder/MD of three startups. He holds a Ph.D. from the Vienna University of Economics and Business and an MBA from the Haas School of Business, UC Berkeley.

Keywords: