GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data

Kavli Affiliate: Jia Liu

| First 5 Authors: Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak

| Summary:

Large language models require vast amounts of high-quality training data, but
effective filtering of web-scale datasets remains a significant challenge. This
paper demonstrates that GPT-4o is remarkably effective at identifying
high-quality training data, but its prohibitive cost makes it impractical at
web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o
accuracy at less than 1% of the cost. SIEVE can perform up to 500 filtering
operations for the cost of one GPT-4o filtering call. The key to SIEVE is a
seamless integration of GPT-4o and lightweight text classification models,
using active learning to fine-tune these models in the background with a small
number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a
tiny fraction of the cost. Through different filtering prompts, SIEVE can
efficiently curate high quality data for general or specialized domains from
web-scale corpora — a valuable capability given the current scarcity of
high-quality domain-specific datasets. Extensive experiments using automatic
and human evaluation metrics show that SIEVE and GPT-4o achieve similar
performance on five highly specific filtering prompts. In addition, when
performing quality filtering on web crawl datasets, we demonstrate SIEVE can
further improve over state-of-the-art quality filtering methods in the
DataComp-LM challenge for selecting LLM pretraining data.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3