ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Penghao Wang, Penghao Wang, , ,

| Summary:

As large language models (LLMs) advance, the ultimate vision for their role
in science is emerging: we could build an AI collaborator to effectively assist
human beings throughout the entire scientific research process. We refer to
this envisioned system as ResearchGPT. Given that scientific research
progresses through multiple interdependent phases, achieving this vision
requires rigorous benchmarks that evaluate the end-to-end workflow rather than
isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of
scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It
is constructed through a scalable, paper-grounded pipeline that combines
retrieval-augmented generation (RAG) with multi-stage quality control to ensure
factual grounding. From this unified corpus, we derive two complementary
subsets: CS-4k, a carefully curated benchmark for evaluating AI’s ability to
assist scientific research, and CS-50k, a large-scale training dataset.
Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs
into distinct capability tiers. Open models trained on CS-50k with supervised
training and reinforcement learning demonstrate substantial improvements. Even
7B-scale models, when properly trained, outperform many larger proprietary
systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that
making AI models better research assistants relies more on domain-aligned
training with high-quality data than on pretraining scale or general benchmark
performance. We release CS-4k and CS-50k in the hope of fostering AI systems as
reliable collaborators in CS research.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=3