Kavli Affiliate: Yi Zhou
| First 5 Authors: Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah. Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli
| Summary:
Data quantity and quality play a vital role in determining the performance of
Large Language Models (LLMs). High-quality data, in particular, can
significantly boost the LLM’s ability to generalize on a wide range of
downstream tasks. Large pre-training datasets for leading LLMs remain
inaccessible to the public, whereas many open datasets are small in size (less
than 5 trillion tokens), limiting their suitability for training large models.
In this paper, we introduce GneissWeb, a large dataset yielding around 10
trillion tokens that caters to the data quality and quantity requirements of
training LLMs. Our GneissWeb recipe that produced the dataset consists of
sharded exact sub-string deduplication and a judiciously constructed ensemble
of quality filters. GneissWeb achieves a favorable trade-off between data
quality and quantity, producing models that outperform models trained on
state-of-the-art open large datasets (5+ trillion tokens).
We show that models trained using GneissWeb dataset outperform those trained
on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed
on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for
pre-training dataset evaluation. When the evaluation set is extended to 20
benchmarks (both zero-shot and few-shot), models trained using GneissWeb still
achieve a 1.75 percentage points advantage over those trained on
FineWeb-V1.1.0.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3