CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Kavli Affiliate: Debanjan Chowdhury

| First 5 Authors: Haining Pan, Haining Pan, , ,

| Summary:

Large language models (LLMs) have shown remarkable progress in coding and
math problem-solving, but evaluation on advanced research-level problems in
hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a
dataset of 50 problems covering condensed matter theory (CMT) at the level of
an expert researcher. Topics span analytical and computational approaches in
quantum many-body, and classical statistical mechanics. The dataset was
designed and verified by a panel of expert researchers from around the world.
We built the dataset through a collaborative environment that challenges the
panel to write and refine problems they would want a research assistant to
solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte
Carlo, density matrix renormalization group (DMRG), quantum/classical
statistical mechanics, and model building. We evaluate LLMs by programmatically
checking solutions against expert-supplied ground truth. We developed
machine-grading, including symbolic handling of non-commuting operators via
normal ordering. They generalize across tasks too. Our evaluations show that
frontier models struggle with all of the problems in the dataset, highlighting
a gap in the physical reasoning skills of current LLMs. Notably, experts
identified strategies for creating increasingly difficult problems by
interacting with the LLMs and exploiting common failure modes. The best model,
GPT5, solves 30% of the problems; average across 17 models (GPT, Gemini,
Claude, DeepSeek, Llama) is 11.4$pm$2.1%. Moreover, 18 problems are solved by
none of the 17 models, and 26 by at most one. These unsolved problems span
Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes
violate fundamental symmetries or have unphysical scaling dimensions. We
believe this benchmark will guide development toward capable AI research
assistants and tutors.

| Search Query: ArXiv Query: search_query=au:”Debanjan Chowdhury”&id_list=&start=0&max_results=3