LLM360 project empowers pre-trainers with huge production ready dataset

LLM360's TxT360 currently ranks #1 on Hugging Face, beating 220k datasets

Oct 17, 2024

#LLM360 #PreTraining - LLM360, an open-source large language model project created by Petuum, and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), has open-sourced a massive fully-cleaned pre-training dataset. Called TxT360 (Trillion Extracted Text), the dataset was created by combining and deduping 99 CommonCrawl datasets and 14 high-quality data sources from diverse domains. With a focus on clean data and precise control, TxT360 equips LLM pre-trainers with a rich, dataset ready for immediate use, providing a upsampling recipe to create over 15 trillion tokens, enough for training the largest models. Outperforming comparable datasets on several key metrics, it also stores rich metadata enabling precise control over data distribution, empowering pre-trainers to explore more advanced weighting techniques.

SO WHAT? - Large language models require trillions of tokens of data, which developers typically create by combining many different datasets. Datasets created for training typically focus on one source, such as crawled websites, code bases, papers, or audio/video transcriptions, but often contain data that overlaps with other datasets. Combining datasets, removing duplications, and eliminating poor quality or erroneous data, is both time-consuming and technically challenging.

LLM360 designed a comprehensive data processing pipeline to create the first integrated, deduped and cleaned dataset for pretraining, to combine the sources most commonly used by developers. The resulting ~5 trillion unique token corpus has been open-sourced, together with the detailed step-by-step procedure used to create the combined dataset. So, TxT360 is an enormous gift for LLM developers!

Here are some of the key details of the TxT360 project:

The LLM360 open-source research project has released TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g. legal documents, literature, etc.).
LLM360 is an open-source large language model project, community and growing open-source framework of best practices and resources created by Petuum, and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), now falling under the university’s the Institute of Foundation Models. The project shares, not only the models and datasets that its researchers develop, but also detailed insights and methodologies.
The TxT360 dataset was created to pretrain large language models by providing a massive amount of high-quality, diverse data. LLM360 researchers designed a comprehensive data processing pipeline to clean data and remove duplicates from 99 CommonCrawl snapshots (data collected from the Internet) plus 14 other high-quality sources such as FreeLaw, PG-19, StackExchange, Arxiv etc.
Attention was also given to data weighting (i.e. to know which data is more important, or is essential for creating a good dataset). Using a simple but effective upsampling recipe, a 15+ trillion-token corpus was created that outperforms the prior best dataset FineWeb 15T on several key metrics.
Rich metadata stored in TxT360 enables precise control over data distribution, and so empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets.
LLM360 has documented all the detailed steps, reasons for decisions made during development, detailed statistics, code (soon to follow), analysis results and more, as a useful resource for LLM developers. It comes with the most detailed technical blogpost on pre-training dataset curation so far.
TxT360 now ranks number one in the Hugging Face list of more than 220,000 datasets (as of 17-Oct-24).

ZOOM OUT - Training large language models can prove to be a costly and complex endeavour, with many unexpected issues. Formed in December 2023, LLM360 aims to empower developers to create LLMs easier, faster and cheaper by providing them with data, code detailed insights, methodologies and best practices. It’s an ambitious project to enable open-source AI, by creating an evolving framework and deep resources for developers globally and it becomes more valuable as time goes on. To-date, LLM360 has released 13 open-source models across four LLM collections, together with model checkpoints, code, data and insights into model development. Its state-of-the-art K2 65B LLM outperforms Meta’s Llama 2, and the LLM360 team aims to leverage the new TxT360 to create more powerful future models in the K2 series.

Middle East AI News

Discussion about this post

Middle East AI News

LLM360 project empowers pre-trainers with huge production ready dataset

LLM360's TxT360 currently ranks #1 on Hugging Face, beating 220k datasets

LINKS

Discussion about this post