MBZUAI-led research team builds Moroccan Arabic AI models
Atlas-Chat models aim to bridge the gap in Moroccan Arabic GenAI
#Morocco #France #UAE #LLMs - A team of researchers led by MBZUAI France Lab, the Paris-based lab of Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) has developed a family of large language models (LLMs) focused on the Moroccan Arabic language dialect (or Darija). The researchers have open-sourced two small language models (SLMs): Atlas-Chat-2B and Atlas-Chat-9B, which were trained on Moroccan Arabic and other Arabic dialects, as part of G42 Group’s Jais LLM project. According to the project team the two Google Gemma 2-based models outperform existing Arabic-focused models in Moroccan Arabic, including Jais 13B, boosting performance in Darija-specific NLP (Natural Language Processing) tasks by 13%*.
SO WHAT? - There are an estimated 40 million speakers of the Moroccan colloquial Arabic, but, although there are a number of projects to create Moroccan AI models, development for the language remains in its early stages. The Atlas-Chat project aims to make AI more accessible for Darija speakers by providing AI models that have a higher level of understanding for the dialect; making the models freely accessible via open-source licences; and releasing small-size models that can be used in resource-constrained environments like laptops, desktops, or personal cloud setups. As a result this improves the chances that more researchers in academia, business and government can learn about and contribute to models for Moroccan Darija.
Here are some more key points about the Atlas-Chat project:
The Atlas-Chat family of large language models is being developed by a team of researchers led by MBZUAI France Lab, an the Paris-based lab of Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
Based on Google Gemma 2 models, two small language models (SLMs) have been open-sourced via AI community Hugging Face: a 2 billion parameter model Atlas-Chat-2B and a 9 billion parameter model Atlas-Chat-9B,
Atlas-Chat-2B provides a small-sized model capable of generating fluent Moroccan Darija text efficiently, while the larger Atlas-Chat-9B provides more nuanced, contextually rich language generation for complex tasks. The small size of the models, means that they can be installed on laptops, desktops, or personal cloud setups, making them accessible to more Darija speakers and encouraging more widespread innovation.
According to the researchers, the new Atlas-Chat models outperform both state-of-the-art and Arabic-specialised LLMs and SLMs like LLaMa, Jais, and AceGPT. A 13% increase in performance was recorded over a larger 13B model using the team’s new evaluation suite for Darija, DarijaMMLU.
The DarijaMMLU evaluation benchmark designed by MBZUAI France Lab to assess language model performance in Moroccan Arabic, which assesses both discriminative and generative tasks. The model consists of 22,027 multiple-choice questions, translated from selected subsets of the Massive Multitask Language Understanding (MMLU) and ArabicMMLU benchmarks to measure model performance on 44 subjects in Darija.
The Darija-SFT-Mixture training dataset used was drawn from diverse datasets focusing on Darija consisting for approximately 450k instructions of a maximum length of 2048 tokens. Sources include: synthetic instructions to guide the models on Moroccan culture, instruction samples created from publicly available Moroccan Arabic datasets including translation, summarization and sentiment analysis, plus translated and multilingual instruction-tuning datasets.
The language models were trained using 8 NVIDIA A100 80 GB GPUs in parallel, via Amazon SageMaker from AWS.
Atlas-Chat is the first significant project to come out of the MBZUAI France Lab, which was established by MBZUAI in January 2024.
ZOOM OUT - The vast majority of Arabic language training data is available in Modern Standard Arabic (MSA), the common written form of the language. This makes creating large language models that can understand colloquial Arabic dialects a slow and difficult process. Although many Arabic speakers prefer query AI assistants in standard Arabic, dialectal Arabic is often mixed with standard Arabic. As voice conversations become more and more important for Generative AI applications, better capabilities in colloquial Arabic will be required. Thus, there are a growing number of projects focused on developing colloquial Arabic datasets and models that process local dialects more effectively.
* Evaluation was conducted using MBZUAI Paris Lab’s new DarijaMMLU evaluation benchmark designed to assess LLM performance in Moroccan Darija.
Read more about Arabic language LLM development:
Arabic LLM index launched at GAIN (Middle East AI News)
Inception launches new JAIS Chat mobile app (Middle East AI News)
G42 launches family of 20 open-source Jais AI models (Middle East AI News)
Algerian AI researchers crowdsource local language data (Middle East AI News)
Huawei reveals 100B Arabic LLM (Middle East AI News)
SDAIA's Arabic LLM now live on watsonx (Middle East AI News)
Hugging Face introduces Open Arabic LLM Leaderboard (Middle East AI News)
LINKS
Atlas-Chat research paper (arXiv)
Atlas-Chat 2B (Hugging Face)
Atlas-Chat 9B (Hugging Face)
Darija-SFT-Mixture training dataset (Hugging Face)
DarijaMMLU evaluation model (Hugging Face)