DALLA framework aims to slash Arabic AI training costs
New pipeline democratises Arabic language model development
#Qatar #LLMs - Doha-based research institution Arab Center for Research and Policy Studies (ACRPS) has released the DALLA framework, an open-source pipeline for building culturally aware Arabic large language models (LLMS). The framework introduces a token reuse technique that reduces training and inference costs by 50 percent to 75 percent, empowering smaller development teams to create socially and culturally aware Arabic AI models. The initiative includes the release of two new open-weight models, dalla-gemma-it 9B and dalla-llama 8B, which are both available via AI community Hugging Face under a Creative Commons licence.
SO WHAT? - The DALLA Framework release aims to addresses a critical gap in Arabic AI development by providing smaller teams and organisations with the tools to build socially and culturally aware language models without the large budgets typically required. The framework’s focus on cultural sensitivity and privacy protection, and encourages more researchers and development teams to build their own Arabic language AI models.
Here are some key facts about the DALLA Framework
The Arab Center for Research and Policy Studies (ACRPS) has released the Doha Arabic Large Language (DALLA) Framework, via its Unit for Research in Arabic Digital and Social Spheres. The framework provides an end-to-end pipeline for data processing, tokenisation, embeddings, continual pretraining and supervised fine-tuning for Arabic language models.
The framework’s innovative token reuse technique achieves a 4X to 2X reduction in token count compared to original models, delivering proportional cost savings in both training and inference operations whilst maintaining effective performance.
Two demonstration models built using DALLA have been released via Hugging Face under Creative Commons cc-by-nc-4.0 licence. The models serve as a demonstration of the DALLA pipeline for adapting open-weight models to Arabic.
dalla-gemma-it 9B, adapted from Google Gemma 2, uses a tokeniser modified through our sentencepiece token reuse method to improve Arabic coverage without increasing vocabulary size.
dalla-llama-it 8B is based on Meta Llama 3.1 and was tokenised using the R-BPE framework to improve Arabic coverage without increasing vocab size.
Both models were further trained on curated, culturally grounded Arabic data to support more fluent Arabic generation and better value alignment with Arab communities.
The models use lighter weights than frontier AI systems but maintain efficient performance through smart data processing and fine-tuning with curated Arabic content that reflects cultural and social awareness specific to Arab communities.
DALLA prioritises privacy protection by enabling organisations to train and operate models without exposing intellectual property, training data or operational information to large technology companies, addressing data sovereignty concerns.
The new framework lays the groundwork for an open-source pipeline that supports the development of open-weights models and preserves their original knowledge base, enables retraining and fine-tuning for Arabic models under a (CC BY-NC-SA) license and allows task-specific adaptation for research, media, education, and public service.
The open-source pipeline supports task-specific adaptation across research, media, education and public service sectors, with all modules released under permissive licensing to encourage community contribution and model improvement.
ACRPS positions DALLA as complementary to sovereign Arabic models like Allam, Jais and Fanar, having previously contributed to Qatar’s Fanar v1 development before resuming work on the framework. The new framework promotes responsible, socially and culturally aware AI research and development community building in the Arabic world.
The research team acknowledges these are early-stage models that may exhibit hallucinations and contextual errors, actively soliciting user feedback through satisfaction ratings to improve subsequent iterations.
ZOOM OUT - Established in 2010, the Arab Center for Research and Policy Studies (ACRPS) is an independent research institution and has built a substantial presence across the Arab world and internationally with offices in Beirut, Tunis, Washington DC, Paris and Madrid. The organisation has published hundreds of academic titles and seven peer-reviewed journals whilst fostering collaboration with Arab and international universities. Its Unit for Research in Arabic Social and Digital Spaces focuses on building Arabic digital infrastructure by combining computational linguistics with the Centre's academic resources, developing software that addresses cultural and value gaps in current generative AI models across social sciences, humanities and political research.
[Written and edited with the assistance of AI]
LINKS
DALLA (ACRPS website - sign-up required)
dalla-gemma-it LLM (Hugging Face)
dalla-llama-it LLM (Hugging Face)
DALLA pipeline code (Github)
Read more about Arabic AI open source initiatives:
Inception, Cerebras and MBZUAI release Jais 2 Arabic LLM (Middle East AI News)
MBZUAI releases Nile-Chat: Egyptian Arabic LLM (Middle East AI News)
MBZUAI releases Arabic ‘inclusive’ multimodal AI model (Middle East AI News)
Arabic LLM index launched at GAIN (Middle East AI News)
Hugging Face introduces Open Arabic LLM Leaderboard (Middle East AI News)


