Inception & MBZUAI unveil Kazakh LLM

New open-source Kazakh language AI model launched

Feb 18, 2025

River Ishim, Astana, Kazakhstan (Image credit: Нурлан via Pexels.com)

#UAE #Khazakh – G42’s applied research arm Inception and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), have announced the launch of SHERKALA, a new Kazakh large language model (LLM). Designed to serve over 13 million Kazakh speakers, Sherkala is an 8 billion parameter model trained on 45 billion words across Kazakh, English, Russian, and Turkish. Built using Meta’s Llama 3.1 with a 25% tokenizer expansion for improved Kazakh fluency, the model was trained on Condor Galaxy, one of the world’s most powerful AI supercomputers, developed by G42 and Cerebras. Sherkala is now available as an open-source model.

SO WHAT? – The new Kazakh large language model is the second major non-Arabic language model to come out of Abu Dhabi’s AI R&D ecosystem. Last September, G42 launched a 13-billion parameter Hindi LLM in India, developed by Inception and MBZUAI. Both Sherkala and the Hindi model were built in context of a bilateral agreements. The UAE’s Ministry of Investment signed bilaterial agreements with the governments of India and Kazakhstan early last year, to include collaboration and investment in artificial intelligence and data infrastructure.

Here are some key points regarding this announcement:

SHERKALA, a new Kazakh large language model has been launched by G42’s applied research arm Inception and the Institute of Foundation Models of Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
Built using Llama 3.1 and trained on 45 billion words, including Kazakh, English, Russian, and Turkish, the 8 billion parameter model will support some 13 million Kazakh speakers.
The new model was developed using a 25% tokeniser expansion to enhance Kazakh language understanding and text generation. Sherkala outperforms larger models, surpassing some 70-billion-parameter LLMs in Kazakh-language generative capability.
Sherkala was trained on Condor Galaxy, one of the world’s most powerful supercomputers optimised for AI, developed by G42 and Cerebras.
The new Kazakh LLM outperforms the ISSAI’s KazLLM 8B model (launched in December) in a number of benchmarks including the MMLU Benchmark (Multi-task Language Understanding).
Sherkala is now available as an open-source model on Hugging Face, for researchers, enterprises, and developers.

ZOOM OUT – In December 2024, the Institute of Smart Systems and Artificial Intelligence (ISSAI) at Kazakhstan's Nazarbayev University announced KazLLM, Kazakhstan’s first large language model, marking a significant milestone for the country's AI R&D. A team of machine learning engineers and skilled linguists began work on the ISSA's Kazakh model begain in April 2024. About 150 billion tokens were used in training, drawn mainly from open source data sets. Based on the Llama 3.1 model, ISSA developed two versions KazLLLM 8B and KazLLM 70B.

Middle East AI News

Discussion about this post

Middle East AI News

Inception & MBZUAI unveil Kazakh LLM

New open-source Kazakh language AI model launched

LINKS

Discussion about this post