First LLM trained exclusively on Saudi data sets
Saudi tech firm Watad announces Mulhem, bilingual Arabic, English LLM
#Saudi #LLMs - Riyadh and Alkhobar-based AI, cybersecurity and smart city technology company Watad has announced Mulhem, the first Saudi Arabia domain-specific large language model (LLM) trained exclusively on Saudi data sets. The bilingual Arabic / English language AI model was developed and trained in the Kingdom, on 90 billion Arabic and 90 billion English data tokens using Watad's own data sets curated to prioritise Saudi data points and context. According to university lab tests, the seven billion parameter model outperforms all comparable Arabic language capable models.
SO WHAT? - Despite the global hype around Generative AI and large language models, to-date there's been relatively little news about Arabic language capable LLMs. Other than the high compute costs associated with AI model development and training, the limited availability of quality Arabic language training data is a significant obstacle to the development of Arabic models. This has both delayed and reduced the number of Arabic LLMs in development, so the arrival of a new AI model trained on region-specific Arabic language data is a rare piece of news.
Here are a few key details:
The Mulhem 7 billion parameter large language model (LLM) was developed and trained by Watad Energy & Communications in Saudi Arabia. This was done on a NVIDIA high performance computing system, using data curated and tokenised by the company. 'Mulhem' is an Arabic word for 'inspiring'.
Mulhem is a proprietary model, although Watad says that it is exploring ways of providing more open access to the LLM, including the potential of open-source. During the coming few weeks, Watad will open up the API to developers and researchers to test Mulhem and its retrieval system on a trial test basis.
With a training foundation of 90 billion Arabic and 90 billion English data tokens, the model is bilingual and linguistic versatile. This dual-language capability ensures that Mulhem can serve a wide range of commercial or public applications, requiring both Arabic and English outputs.
Mulhem has been trained on a diverse dataset of more than 70,000 Saudi-specific Q&A data points and over 500,000 Arabic single turn (i.e. one question and answer), multi-turn (i.e. a conversation with multiple questions and answers), preference data sets, along with specialised datasets for context retrieval and offline data systems. This rich Saudi-centric dataset ensures accuracy and relevance in its responses and insights, specifically tailored to local context.
According to tests carried out by Watad, the seven billion parameter Mulhem outperforms all other 7B Arabic language capable AI models.
The new LLM was both trained and developed in Saudi Arabia, using the latest in AI research and development practices, including Supervised Fine-tuning and Direct Preference Optimisation. The combination of best practices and local knowledge vouches for the alignment of Watad's model with the specific needs and nuances of the Saudi market, plus the rich cultural context of the Arab world.
With its own proprietary Arabic data tokeniser and retrieval model, Watad will be able to offer Retrieval Augmented Generation (RAG), fine-tuning Mulhem to customers using their own data for commercial use cases.
ZOOM OUT - In contrast to the overwhelming volume of news about global large language models, advances in Generative AI and next generation GenAI applications, there has been little news about Arabic language LLMs. The biggest Arabic LLM news from the past year, was the release of G42's Jais 13B Arabic-English LLM as open source in August (and followed by Jais 30B in November), which is now available via Hugging Face and Microsoft Model-as-a-service. This provided some hope to Arab developers looking for an Arabic language model to use. However, both end-users and software developers in the region lack choice and so, in the short term, all news is good news!
IMO - With a number of LLMs now in development across the Middle East, end-users and software developers should be able to look forward to a not too distant future, where they have a choice of commercially available Arabic language AI models. As evidenced by Watad's new LLM, there is also opportunity for developers to focus development and training on the domain-specific needs of countries, regions, industries and applications. Therefore, although there is room for a few big LLMs for ubiquitous use, we can expect multiple Arabic language LLMs to thrive. Current predictions are that we could see this happen during the next 12-18 months.
Error corrected 30-May-24
Read more about LLM development:
Could Falcon become the Linux of AI? (Middle East AI News)
TII announces Falcon 180B LLM (Middle East AI News)
Will GenAI champion the Arabic language? (Middle East AI News)
Can UAE-built Falcon rival global AI models? (Middle East AI News)