Kavli Affiliate: Jia Liu | First 5 Authors: Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak | Summary: Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its […]
Continue.. GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data