New benchmark challenges inclusivity of global AI language models
ALM Bench evaluates 100 languages and cultural contexts
#UAE #benchmarks - More than 70 researchers from Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and other institutions worldwide have published the All Languages Matter Benchmark (ALM Bench), the most comprehensive evaluation framework for large multimodal models (LMMs) on world languages and cultures. ALM Bench assesses 100 languages, including underrepresented ones like Cebuano and Kyrgyz, across 22,000 cultural question-answer pairs. Proprietary models like GPT-4o scored highest, with 78.8% accuracy, while open-source models lagged at 51.8%, exposing critical inclusivity gaps in low-resource languages.
SO WHAT? - LLMs and LMMs have made remarkable progress over the past few years and have become important tools or more and more people. However, the performance of these AI models varies greatly depending on the language used and the cultural context relevant to the end-user. Dominant languages such as Chinese and English feature heavily in model training sets, but with nearly 7,000 known human languages spoken across the world, models are far from inclusive. Inspired by this challenge, researchers set out to improve multimodal LLMs on a wide array of languages. This is much needed research that could provide the foundation for other researchers all around the world.
Here are some highlights about the ALM Bench research project:
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) has published the findings, data and code for an All Languages Matter Benchmark (ALM Bench) under the university’s open-source LLM Oryx Library initiative.
ALM Bench evaluates LMMs in 100 languages, including three Arabic dialects. The benchmark also uses a comprehensive set of 13 cultural aspects, such as heritage, customs, architecture, literature, music, and sports.
The research was carried out by more than 70 researchers from MBZUAI, Aalto University (Finland), Amazon, Australian National University, Linkoping University (Sweden), and University of Central Florida.
ALM Bench data, code and the research paper were first published in November 2024, and the data set was updated this month with 7 new benchmark tests.
The dataset features 22,000 pairs across true/false, multiple-choice, and open-ended formats.
In benchmarking tests proprietary models like OpenAI’s GPT-4o achieved 78.8% accuracy, compared to the best open-source model's 51.8%.
The study emphasises gaps for underrepresented language families, such as Turkic and Atlantic-Congo. For example, GPT-4o scored 50.8% in Amharic, highlighting gaps in low-resource language representation.
Based on empirical analysis on 16 vision-language models with various question types (MCQs, T/F, SVQA, and LVQAs), including images boosted model accuracy by 27% across languages, indicating the importance of multimodal inputs.
Sixty annotators, 80% of whom were native speakers, ensured linguistic and cultural accuracy.
The research found that the models provide superior performance on predominant language scripts such as Latin, Cyrillic, and Devanagari and under-perform on underrepresented scripts such as Ge’ez, Lao, Oriya, and Sinhala. The study also emphasises bridging gaps for underrepresented language families, such as Turkic and Atlantic-Congo.
ZOOM OUT - AI tools are increasingly shaping global communication and decision-making, but invariably these new tools are developed and trained in English first (or in Chinese, in the case of China’s LLM research). Other ‘high-resource’ languages are the next best represented (e.g. French, German, Spanish). Favouring the world’s most dominant languages in model development leads to biases and the exclusion of other world languages and cultures. The ALM Bench highlights the challenge, however meeting it will require many other AI development, data and investment initiatives by different countries. As AI becomes increasingly ubiquitous and critical to countries around the world, failure to step up to this challenge could run the risk of losing some of our languages altogether.
LINKS
ALM Bench landing page (MBZUAI)
ALM Bench research paper (arXiv)
ALM Bench data set (Hugging Face)
ALM Bench code (Github)
Read more about LLM leaderboard and evaluation projects:
LibrAI creates LLM leaderboard for AI safety (Middle East AI News)
Inception & MBZUAI launch new Arabic LLM leaderboard (Middle East AI News)
Comprehensive multimodal Arabic AI benchmark (Middle East AI News)
New telecom LLMs leaderboard project (Middle East AI News)
M42 delivers framework for evaluating clinical LLMs (Middle East AI News)
Arabic LLM index launched at GAIN (Middle East AI News)
New Hugging Face Open Arabic LLM Leaderboard (Middle East AI News)