MBZUAI releases Nile-Chat: Egyptian Arabic LLM
Nile-Chat 4B & Nile-Chat 12B designed for Egyptian Arabic communication
#UAE #Egypt #LLMs – Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) has released Nile-Chat, two open-source large language models (LLMs) tailored for the Egyptian Arabic language dialect. Developed by MBZUAI France Lab in Paris, the models come in 4 billion and 12 billion parameter versions, supporting both Arabic script and Arabizi (Arabic written in Latin script) . Nile-Chat aims to enhance natural language understanding and generation in one of the most widely spoken Arabic dialects, facilitating applications such as question answering, translation, and transliteration.
SO WHAT? – Egyptian Arabic is a a dialect spoken by over 100 million people, or roughly a quarter of all the world’s Arabic speakers, so there is a significant need to develop AI models capable of conversing the dialect. So far MBZUAI’s LLM research has tackled three main flavours of Arabic, Modern Standard Arabic (MSA), the Emirati dialect and associated Gulf dialects and Moroccan Darija. Nile-Chat is the university[s first open-source model for the Egyptian dialect.
Key points about the new Nile-Chat models:
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) has released Nile-Chat 4B and Nile-Chat 12B, two open-source large language models (LLMs) tailored for the Egyptian Arabic language dialect.
The Nile-Chat family includes two models: a 4 billion parameter model optimised for efficiency and a 12 billion parameter model offering higher capacity for complex interactions.
Nile-Chat is specifically designed to understand and generate text in Egyptian Arabic, accommodating both traditional Arabic script and Latin-based Arabizi, reflecting common usage patterns in digital communication.
Nile-Chat-12B outperforms existing models like Meta’s LLaMA and Arabic-specific models such as Saudi Arabia’s ALLaM on Egyptian dialect benchmarks, as well as on translation and transliteration tasks.
The models were trained on approximately 3.3 billion tokens from diverse Egyptian web texts, 1.9 million instruction samples, and 0.2 million samples for direct preference optimization, covering both scripts.
Nile-Chat was evaluated using benchmarks adapted for Egyptian Arabic, including EgyptianMMLU, EgyptianHellaSwag, and EgyptianPIQA, demonstrating strong performance across various tasks.
Training of the Nile-Chat model was conducted using 8 NVIDIA A100 80GB GPUs on AWS SageMaker, employing HuggingFace transformers and parameter-efficient fine-tuning techniques.
The models are open-source and available for use via Hugging Face, encouraging further research and application development in Arabic dialects.
ZOOM OUT – MBZUAI's research supports linguistic diversity in AI models, with a particular focus on the Arabic language. Earlier this year, the university released AIN the first comprehensive bilingual Arabic-English inclusive large multimodal model (LMM). Following the debut of its CAMEL-Bench benchmark, AIN sets a new standard for Arabic-English bilingual AI, highlighting MBZUAI’s commitment to creating inclusive, culturally aware AI systems. This year’s new models build on the university’s track record of advancing foundational models tailored to regional needs and bridging the gap between Arabic users and state-of-the-art generative AI.
[Written and edited with the assistance of AI]
LINKS
Nile-Chat 4B model (Hugging Face)
Nile-Chat 12B model (Hugging Face)
Updated links 05-Jun-24
Read more Arabic language model development:
Falcon 3 LLM series gets first Arabic model (Middle East AI News)
Inception & MBZUAI share Arabic AI Leaderboards Space (Middle East AI News)
Tarjama& deploys powerful Arabic model: Pronoia v2 (Middle East AI News)
AtlasIA releases smarter, faster Moroccan darija AI models (Middle East AI News)
Mistral AI unveils Mistral Saba for Arab world (Middle East AI News)