SDAIA's Arabic LLM now live on watsonx
SDAIA and IBM announced Arabic large language model at Think Boston
#USA #LLMs - Saudi Data and Artificial Intelligence Authority (SDAIA) and IBM have announced the launch of SDAIA’s open-source Arabic large language Model ALLaM on IBMs’ enterprise AI and data platform, watsonx. Government and commercial users can access the model via the watsonx.ai studio and leverage advanced AI capabilities to train, tune, and deploy ALLaM. The model has an optional AUP (acceptable use policy) and clients can use watsonx's industry leading governance capabilities to enable responsible deployment under IBM’s ethical AI guidelines. The news was announced at IBM"s Think Boston 2024 event.
SO WHAT? - Saudi Data and Artificial Intelligence Authority (SDAIA) launched a pilot version of the Arabic chat app for the ALLaM Arabic large language model (LLM) in May 2023. However, this week's announcement with IBM, marks the first time that the ALLaM model can be download and used by other organisations. The availability of ALLaM via IBM's watsonx gives government organisations in Saudi Arabia a government-created and backed Arabic large language model to use to build their own LLMs and GenAI applications.
Here are a few key details:
The new ALLaM 13B model was announced by Essam Al-Waqeet, director of the National Information Center at the SDAIA, during his speech at IBM"s Think Boston 2024 event. According to automatic benchmark and multi-turn benchmark tests conducted and share via watsonx by SDAIA, the new model outperforms all other Arabic language LLMS of a similar parameter size.
The ALLaM 13 billion parameter large language model (ALLaM-1-13b-instruct on watsonx) was developed by the National Center for Artificial Intelligence (NCAI) at the Saudi Data and AI Authority (SDAIA), and is the first of a series of models to be made available for open access. The ALLaM-1-13b-instruct published via watsonx is based on SDAIA’s ALLaM-13b-base foundation model, which is based on Meta’s Llama 2.
ALLaM 13B was pre-trained on a total of 3 trillion tokens in Arabic and English, including the tokens seen from its initialisation. The Arabic dataset contains 500 billion tokens after cleaning and deduplication. Additional data was collected from open-source collections and web crawls.
The ALLaM-1-13b-instruct foundation model was fine-tuned with a curated set of 4 million Arabic and 6 million English prompt-and-response pairs.
The model’s bilingual capability ensures that it can be used for a wide range of applications that require both Arabic and English outputs,
SDAIA is developing a family of Arabic language-trained ALLaM models, using two main paths performing additional training from open source models and pretraining models from scratch.
ALLaM 13B was initialised from Llama 2 architecture weights (2 trillion tokens from Llama-2, plus 1 trillion additional Arabic and English tokens) The training codebase was built on NVIDIA/MegatronLM, which enables the training large transformer language models at scale.
ALLaM is available via IBM’s watsonx platform under a royalty-free SDAI licence, coupled with a Llama 2 Community licence. There are no restrictions placed on commercial use beyond those in the Llama 2 licence, allowing both commercial and government organisations to use and build-on the model.
LINKS
Updated 30-Jun-24