Will GenAI champion the Arabic language?
The powerful new Jais LLM could mark a turning point for Arabic AI
Jais 13B, a new, powerful and high-quality Arabic large language model was launched, under a fully open source licence on Wednesday. It's another huge win for Abu Dhabi's fast-growing AI research and development ecosystem, but could this LLM prove to be a turning point for Arabic AI software?
Inspired by the United Arab Emirates's highest peak, Jebal Jais, the new AI model is well-positioned to bring the advantages of generative AI to the Arabic-speaking world. Firstly, because it is a high-quality model, and secondly, because its open source licence allows it to be used equally by academia, government, business and individual AI professionals.
The reason that this is such a big deal is that the majority of LLMs so far, have been trained primarily on English language data, with the intent that their primary input and output language will be in English. Developers, software companies, startups, researchers and coders have been able to download and use a variety of English language LLMs to perform tasks, create applications and integrate with other software. Among other things, this has resulted in an enormous wave of innovation, creating new use cases, new apps and new startup ventures: all relying on the foundational infrastructure created by AI labs with access to the latest and most expensive high performance computers.
What Arabic language users need is access to LLMs that can process Arabic language content accurately and efficiently. However, training LLMs in the Arabic language is not easy.
As the name implies, training large language models, requires large volumes of data, and most models are trained on English language content. To train AI models on Arabic content is problematic. To begin with, less than one percent of all pages on the Internet are in Arabic, only a relatively small percentage of Arabic academic and literary content has made its way online, and websites from the region are routinely created in English language only. It is also time-consuming and cost-prohibitive to scan large volumes of offline content to use as training data.
Furthermore, Arabic is a complex language, with its own right-to-left script, multiple different letter forms, different written and spoken forms, many different spoken dialects, and a vocabulary that varies greatly across the Arab world.
Thus, the new Jais 13B was trained after significant learnings from trial and error, and using a wide variety of Arabic language sources. Additionally, in order to train the AI model with enough data, Arabic language content had to be supplemented with English content. The model's final production training cycle included not only 116 billion Arabic tokens, but also 279 billion English tokens of data. The result is a fully bilingual large language model that can provide the foundation for a variety of Arabic/English digital services.
This was all made possible by the considerable resources of Abu Dhabi's artificial intelligence technology pioneer G42. The project was the result of a collaboration between the group's applied research arm Inception (formerly Inception Institute of Artificial Intelligence), Mohamed bin Zayed University of Artificial Intelligence, and AI chip maker Cerebras Systems. The model's training, fine-tuning and evaluation was run on the $100 million, 4 exaFLOP, 54 million core Condor Galaxy 1, which was revealed by Cerebras and G42 Cloud last month.
Jais 13B incorporates ALiBi (Attention with Linear Biases) position embeddings, which enable the model to extrapolate to longer inputs, providing better context handling and accuracy. Other techniques used include SwiGLU (Swish-Gated Linear Unit) and maximal update parameterisation, in order to improve the model's training efficiency and accuracy. In consequence, the LLM significantly outperforms other Arabic large language models.
The end-to-end development of Jais 13B took place over just a few months, resulting in 'production training' over a 21 day period. That's fast!
Now that Jais 13B has been open sourced under an Apache 2.0 licence (the model is available via Hugging Face), it provides the opportunity for Ai and data professionals, developers and researchers all over the world to create their own use cases using the model.
In the meantime, the Jais research team has already been working with different parts of the G42 Group on use cases. it has also initiated discussions with Jais launch partners, such as the UAE Ministry of Foreign Affairs, the Ministry of Industry and Advanced Technology, the Department of Health – Abu Dhabi, the Abu Dhabi National Oil Company (ADNOC), Etihad Airways, telecom giant e&, and FAB (First Abu Dhabi Bank).
Since the Jais project team has access to the Condor Galaxy 1 supercomputer, it will be able to turn around use cases very fast. So, despite its open source credentials, it seems that G42 customers will enjoy a distinct advantage when it comes to developing apps and solutions, based on Jais 13B's foundational model.
The researchers expect automation, productivity and content generation to drive early benefits for customers, while the biggest users of Jais are predicted to be in the medical, financial and energy sectors. There are also plans to continue development, delivering multi-modal functionality at a later date.
However, the arrival of Jais 13B also provides broad opportunity for application development and the foundation model to create new, advanced Arabic language services. It's inevitable that many early attempts at creating new services will mimic concepts that have been developed for English language users on OpenAI's ChatGPT, but Jais will also encourage the development of new concepts specific to the Arabic speaking world's companies, consumers, cultures and locations.
Today, many Arabic language users find themselves forced to adopt English language apps and digital services, due to the lack of Arabic language alternatives. The fast rise of generative AI and the dominance of English language in these new technologies, risks leaving Arabic digital development far behind, as developers are deprived of high quality Arabic AI platforms.
Jais 13B and the models that will follow it, could fuel the next generation of Arabic software development, helping to enable cutting-edge Arabic digital services for the world's 400 million Arabic language speakers. Inspiring a new wave of advanced Arabic services, generative AI could even help to make the Arabic language stronger.
R&D moves up Abu Dhabi's agenda (Middle East AI News)
Cerebras & G42 build massive cloud supercomputer network (Middle East AI News)
MBZUAI releases first LLM library (Middle East AI News)
Can UAE-built Falcon rival global AI models? (Middle East AI News)