G42 relaunches Inception with family of 20 open-source Jais AI models
Biggest and most comprehensive release of Arabic LLM series to-date
#UAE #LLMs - Abu Dhabi-based artificial intelligence powerhouse G42 has relaunched Inception, the group’s entity focused on developing advanced AI models and applications, together with a family of 20 Arabic-centric Jais large language models. Inception, which was previously merged with G42 Cloud and Injazat in October to form Core42, will now drive G42’s AI model development headed up by Andrew Jackson, who was also its CEO last year.
All the Jais large language models have been released under full open-source licences for researchers, developers and commercial use. The 20 models include 8 sizes, ranging from 590M to 70B parameters, trained on up to 1.6T tokens of Arabic, English, and code data. All pre-trained models in this series are instruction fine-tuned (*-chat
) for dialog using a curated mix of Arabic and English data.
SO WHAT? - G42’s announcement of a family of 20 Jais LLMs is the largest single release of production-ready models in the Middle East and North Africa, and the most comprehensive family of Arabic-centric LLMs to have been released anywhere. The release bolster’s the Jais model’s standing as a popular Arabic LLM, providing both the flexibility for those building downstream applications, and confidence that the family of models will be supported and developed (in the form of Inception, as a focused developer and curator of Jais). Furthermore, the new release underscores G42’s commitment to open-source, and therefore also adds weight to the entire Abu Dhabi AI open-source ecosystem.
More details on this announcement:
G42 has announced the most extensive family of large language models (LLMs) in the Middle East and North Africa, releasing 20 open-source, Arabic-centric Jais LLM’s with sizes from 590M to 70B parameters, trained on up to 1.6T tokens of Arabic, English, and code data
The new family of Jais LLMs has been released under the Inception brand, the company that will drive all future development of Jais.
Inception, which was merged with G42 Cloud and Injazat to form Core42 last October, has now been relaunched as a separate entity.
The family contains two variants of foundation models that include:
Models pre-trained from scratch (
jais-family-*
).Models pre-trained adaptively from Llama-2 (
jais-adapted-*
).
According to G42, Jais 70B delivers Arabic-English bilingual capabilities at an unprecedented size and scale for the open-source community. The new model has an increased ability to handle complicated and nuanced tasks, as well as better capability to process complex datasets.
Jais 70B was developed using continuous training, a process of fine-tuning a pre-trained model, on 370 billion tokens of data, of which 330 billion were Arabic tokens. According to G42, this is the largest Arabic dataset ever used to train an open-source foundational model.
The new LLM family includes 590M, 1.3B, 2.7B, 6.7B, 7B, 13B, 30B and 70B sizes.
All 20 models are pre-trained models and instruction fine-tuned (
*-chat
) for dialog using a curated mix of Arabic and English instruction data.The Jais development team trained an expanded tokenizer based on the Llama-2 tokenizer to enhance Arabic text processing efficiency, and so doubling the model’s base vocabulary. This was used to train alternative 70B, 30B and 13B versions of Jais.
Jais was trained on the Condor Galaxy supercomputer cloud network developed by Cerebras Systems and G42.
The new Jais models are available via an Apache 2.0 open-source licence on Hugging Face.
ZOOM OUT - The demand for Arabic-capable large language models is growing fast and the past year has seen AceGPT, Huawei Pan Gu and SDAIA’s ALLaM Arabic-centric models released. However, the needs of organisations and developers are varied. Having a few models of different sizes, with a limited number of model versions, on a variety of different platforms, means that it has not been possible to compare different models to each other. Meanwhile, an organisation’s ability to use a model will depend on its own environment, its cloud services providers, the level of its AI expertise and its budget.
IMO - The new open-source family of Jais models is the most comprehensive initiative to-date to democratise access to Arabic LLMs. Following this release, end-users and developers will be able to access a wide variety of different Arabic-centric models, both via Hugging Face, and increasingly, as services via different development environments in the cloud. This could provide the flexibility and open access that the market needs and so accelerate adoption of Jais models across multiple sectors.
Read more about Inception:
G42 merges Injazat with G42 Cloud and Inception (Middle East AI News)
Links
Jais Family model page (Hugging Face)