September 27, 2024 in Generative AI

How to Make Generative AI Better for Non-English Speakers

Lakshmikanth Alluru

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2024.04.07

Advances in generative artificial intelligence (GenAI) are creating a gap between English and non-English speakers. On one hand, English speakers enjoy access to information at a level that has never been available. Simply put, they can access more information faster. On the other hand, many non-English speakers deal with worse results, higher costs and more frustration when using chatbots, content moderation systems and search engines. The primary issue is that although there are 7,000 languages worldwide, there is only one “extremely high-resource” language – English. In addition, only a fraction of the 7,000 languages are considered “high-resource.” This group includes Russian, German, Chinese, Japanese, French, Spanish, Italian, Dutch, Polish, Portuguese and Vietnamese, among others.

Those who don’t speak English or one of the few high-resource languages are left with a GenAI product that lacks the data and training needed to perform optimally. To rectify this situation, it’s vital to adopt global efforts to expand AI technologies. Additionally, it is essential to address current challenges, such as data scarcity, handling of dialect variations and evaluation of the quality in non-English languages. If inclusivity for more languages isn’t guaranteed, there could be situations in which only some people have easy access to information.

Why English Has Become the “Default Language” for GenAI

English is a highly resourced language. It is the dominant language in science – as much as 98% of all papers are published in English, according to theconversation.com. Combine English’s scientific popularity with its prominent use on the internet (51.3% of all web pages are hosted in the United States) and its dominance in popular culture, international politics and higher education, and the result is a marketplace in which English-language resources dwarf the resources of all other languages. This is important because it means GenAI training models have many more English resources to draw from, and the more training resources available, the better and more comprehensive the finished product.

As a result of this resource gap, a clear English bias has emerged in GenAI training. The resource gap refers to the difference in the availability of high-quality digitized text between English and other languages. For instance, take OpenAI’s popular ChatGPT-3 model; some figures indicate that its training data consisted of more than 90% English text (statistics aren’t yet available for GPT-4). The high percentage of English text means English queries have a better chance of being accurately answered. English speakers will likely enjoy a better user experience than speakers of less-resourced languages such as Thai or Greek.

GenAI Performance Differs Widely by Language Used

Although English users have grown accustomed to fast, accurate query responses, speakers of other languages are having much different experiences. In a study titled “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning,” researchers found that ChatGPT’s performance is better for English than for other languages, especially for higher-level tasks requiring more complex reasoning abilities. ChatGPT can also perform better with English prompts, even though the task and input texts are intended for other languages. The researchers also discovered that performance gaps between English and other high-, medium-, low- and extremely low-resource languages “are usually very large,” and prompting ChatGPT with English tends to produce better results for multilingual question-answering (QA) prompts than using target languages.

Backing up the ChatGPT AI model findings, FLTMAG, a technology magazine, recently reported that Bing Chat performs better in English than in Spanish and other languages. Others say Google’s Gemini performed well for English, Russian and other high-resource languages, but performance fell for more obscure languages, including Yoruba and Māori.

Another issue related to the dominance of English in GenAI training is higher tokenization for those who speak other languages. In other words, their languages require more tokens per word than equivalent words in English. Tokens are usually defined as words, subwords or characters and are used by large language models (LLMs) to process text. They typically refer to the smallest text data unit an AI model can process. During training and inference, LLMs generate output by predicting what token will likely follow next in a sequence of tokens. Another way to view tokens is as a currency that exists within the economy of an LLM, so higher tokenization means that non-English speakers pay more in tokens (for a worse performance) than English speakers. According to a study entitled “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models,” Arabic speakers pay nearly four times what English speakers pay in tokens, and Greek speakers pay more than five times as much. The higher tokenization required of non-English speakers leads to slower performance because more tokens are needed than equivalent English prompts, and fixed context windows limit how much information can be entered into a prompt.

How to Improve GenAI Performance for Non-English Speakers

The first step to improve GenAI performance for non-English speakers is to increase the number and quality of data sets given to the LLM for less-resourced languages. For example, companies could establish minimum resource requirements for non-English languages to ensure enough content is available to improve performance.

Meeting resource thresholds is not enough. There is a common saying with AI – “garbage in, garbage out.” GenAI is only as good as the data it utilizes. To generate accurate responses, the data fed into models must be of high quality. One effective way to improve data quality is for GenAI creators to develop a hybrid data evaluation strategy that includes AI and humans. This combination of technology and human experience will allow for greater fine-tuning of large data sets, which is essential because within each language, dialects and other linguistic subtleties can significantly impact the accuracy of the GenAI response.

One recent empirical results and analysis study found that GenAI models do well translating other languages into English but struggle to rewrite English into different languages (e.g., Korean) because of the lack of available quality resources. GenAI models frequently make translation mistakes such as misgendering titles. By collecting and curating data in non-English languages more precisely, GenAI creators can avoid these common mistakes and improve performance across various languages. This approach requires hiring more diverse teams that include speakers of a variety of languages familiar with the linguistic subtleties that can significantly impact GenAI performance and accuracy.

The Future for Non-English Speaking GenAI Users

Although there are significant complexities in developing effective AI applications for non-English speakers, it is worth the effort because GenAI promises to transform digital interactions globally and accelerate international expansion. When speakers of specific languages are left behind, valuable insights and advancements might be missed. As Italian film director and screenwriter Federico Fellini said, “A different language is a different vision of life.” Taking steps now to guarantee greater inclusivity and effectiveness of GenAI will contribute to a rich, diverse tapestry of human knowledge and experience to inspire future advancement. It will also reduce bias and help preserve cultural heritage, which is one of the goals of the World Economic Forum for the development of new GenAI. Failing to address this growing issue could create an enormous digital divide that prevents large swaths of the global population from fast, easy access to accurate information.

Note: The views and opinions expressed in this article are those of the author and may not reflect those of his employer.

Lakshmikanth Alluru

Lakshmikanth Alluru is a seasoned principal product manager with more than 10 years of experience in consumer product management and 15 years in creating software products. He has a proven track record at top-tier companies including LinkedIn, Amazon, IMDb, Deloitte and IBM. Lakshmikanth has a strong background in leveraging AI/ML to drive engagement and growth. He holds an MBA from UCLA’s Anderson School of Management and a master’s degree in engineering management from Duke University. Connect with Lakshmikanth on LinkedIn.

Keywords: