EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Kavli Affiliate: Ke Wang

| First 5 Authors: Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang

| Summary:

Equivalence checking, i.e., determining whether two programs produce
identical outputs for all possible inputs, underpins a broad range of
applications, including software refactoring, testing, and optimization. We
present the task of equivalence checking as a new way to evaluate the code
reasoning abilities of large language models (LLMs). We introduce EquiBench, a
dataset of 2400 program pairs spanning four programming languages and six
equivalence categories. These pairs are systematically generated through
program analysis, compiler scheduling, and superoptimization, covering
nontrivial structural transformations that demand deep semantic reasoning
beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs
shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In
the most challenging categories, the best accuracies are 62.3% and 68.8%, only
modestly above the 50% random baseline for binary classification, indicating
significant room for improvement in current models’ code reasoning
capabilities.

| Search Query: ArXiv Query: search_query=au:”Ke Wang”&id_list=&start=0&max_results=3