Kavli Affiliate: Zhuo Li
| First 5 Authors: Zhuo Li, Yuhao Du, Xiaoqi Jiao, Yiwen Guo, Yuege Feng
| Summary:
Selecting high-quality and diverse training samples from extensive datasets
plays a crucial role in reducing training overhead and enhancing the
performance of Large Language Models (LLMs). However, existing studies fall
short in assessing the overall value of selected data, focusing primarily on
individual quality, and struggle to strike an effective balance between
ensuring diversity and minimizing data point traversals. Therefore, this paper
introduces a novel choice-based sample selection framework that shifts the
focus from evaluating individual sample quality to comparing the contribution
value of different samples when incorporated into the subset. Thanks to the
advanced language understanding capabilities of LLMs, we utilize LLMs to
evaluate the value of each option during the selection process. Furthermore, we
design a greedy sampling process where samples are incrementally added to the
subset, thereby improving efficiency by eliminating the need for exhaustive
traversal of the entire dataset with the limited budget. Extensive experiments
demonstrate that selected data from our method not only surpass the performance
of the full dataset but also achieves competitive results with state-of-the-art
(SOTA) studies, while requiring fewer selections. Moreover, we validate our
approach on a larger medical dataset, highlighting its practical applicability
in real-world applications.
| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3