Algerian AI researchers crowdsource local language data

Project builds Daridja language data set to develop multilingual LLM

Jul 15, 2024

The Maghreb (Image credit: Image by wirestock on Freepik)

0:00

-4:11

#Algeria #LLMs - Hadretna, a research project formed by Algerian-French startup Fentech, (also Tamatech in Algeria) in collaboration with AI scientist Professor Merouane Debbah, is developing a large language model, trained on crowdsourced data for the Algerian Arabic dialect Daridja and the Berber language Tamazight (or Amazigh). Daridja refers to a number of Maghrebi dialects of the Arabic language, which are spoken in Algeria, Morocco and Tunisia. Started this year, Hadretna's aims to help drive digital inclusion by offering Daridja and Tamazight speakers access to global information via GenAI. The project has already built its first pre-trained LLM using 2 billion tokens of data and has launched a crowdsourcing initiative, as part of its effort to acquire more data that can be used to train a bigger, Daridja and Tamazight-capable AI model. The research team is also planning to add voice to the model.

SO WHAT? - There have been a few early large language models developed for Algerian Daridja, but quality data sets for Arabic dialects are hard to come by, and so past models have been trained on quite small data sets. Therefore, Daridja speakers have no access to a reliable and fully-capable public model that speaks their own language. Meanwhile, the number of speakers of the local Arabic language dialect in Algeria, has recently been estimated at 45 million (with a further 39 million and 12 million people speaking similar local Arabic dialects in Morocco and Tunisia, respectively). Hadretna sets out to acquire, curate and crowdsource new data sets to create billions of data tokens, in order to build a more comprehensive and useful LLM capable in Daridja and Tamazight.

A few key details about Hadretna:

Hadretna (literally 'our speaking') is a project that aims to use Generative AI to promote the use of Algerian Arabic language dialects (Daridja) and the Berber language Tamazight (or Amazigh). The project team aims to help democratise access to information for people in the region, and promote digital inclusion, allowing all Algerians to use GenAI in their own language.
Hadretna’s research team has already built a large language model (LLM) using 2 billion data tokens, including data collected online in Arabic, Latin, and Tifinagh alphabets (Tifinagh is the official script for Tamazight). However, the team now has its sights set on building the largest body of text in Daridja, which will be used to develop a bigger and more reliable LLM. Data compiled to-date currently includes several billion lines of text in Daridja dialects.
A key element of the project is the crowdsourcing of language data, collected via the project's website. Daridja speakers can participate in the project via the Hadretna website, by adding translations of Arabic, English or French, to Daridja and Tamazight, annotating their entries.
The Hadretna project aims to ultimately make all the Daridja and Tamazight language data compiled available as open-source.
Interested Daridja or Tamazight speakers can actively take part in the project, through its website www.hadretna.ai, by adding annotations, translations, etc.
The research project also plans to make all LLMs developed using different varieties of Daridja, available as open-source in the future. Version two of the Hadretna LLM is expected to be released in early 2025.

ZOOM OUT - Arabic large language model development has lagged behind the global development of English language LLMs, and a key limiting factor is the availability of quality Arabic language data, in particular in colloquial Arabic. Some Arab countries, such as Egypt, Qatar, Saudi Arabia and the UAE have invested heavily in projects to acquire and digitise Arabic language. However, LLMs typically require data sets consisting of tens of billions of tokens, presenting a challenge for all developers of Arabic LLMs. Dialectal data sets are scarce, compared to data sets for standard Arabic (fusha). Therefore, local language data projects are much needed in North Africa, where the common words and phrases differ greatly from fusha and Middle Eastern Arabic dialects.

Read more about Arabic language LLM development:

Huawei reveals 100B Arabic LLM (Middle East AI News)
SDAIA's Arabic LLM now live on watsonx (Middle East AI News)
Hugging Face introduces Open Arabic LLM Leaderboard (Middle East AI News)
First LLM trained exclusively on Saudi data sets (Middle East AI News)
Will GenAI champion the Arabic language? (Middle East AI News)

Middle East AI News

Discussion about this post

Ready for more?