CodeScore: Evaluating Code Generation by Learning Code Execution

Kavli Affiliate: Zhuo Li

| First 5 Authors: Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li

| Summary:

A proper code evaluation metric (CEM) profoundly impacts the evolution of
code generation, which is an important research field in NLP and software
engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU)
suffer from two significant drawbacks. 1. They primarily measure the surface
differences between codes without considering their functional equivalence.
However, functional equivalence is pivotal in evaluating the effectiveness of
code generation, as different codes can perform identical operations. 2. They
are predominantly designed for the Ref-only input format. However, code
evaluation necessitates versatility in input formats. Aside from Ref-only,
there are NL-only and Ref&NL formats, which existing match-based CEMs cannot
effectively accommodate. In this paper, we propose CodeScore, a large language
model (LLM)-based CEM, which estimates the functional correctness of generated
code on three input types. To acquire CodeScore, we present UniCE, a unified
code generation learning framework, for LLMs to learn code execution (i.e.,
learning PassRatio and Executability of generated code) with unified input.
Extensive experimental results on multiple code evaluation datasets demonstrate
that CodeScore absolutely improves up to 58.87% correlation with functional
correctness compared to other CEMs, achieves state-of-the-art performance, and
effectively handles three input formats.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3