A benchmark for vericoding: formally verified program synthesis

Kavli Affiliate: Max Tegmark

| First 5 Authors: Sergiu Bursuc, Sergiu Bursuc, , ,

| Summary:

We present and test the largest benchmark for vericoding, LLM-generation of
formally verified code from formal specifications – in contrast to vibe coding,
which generates potentially buggy code from a natural language description. Our
benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in
Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find
vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny
using off-the-shelf LLMs. Adding natural-language descriptions does not
significantly improve performance. We also find that LLM progress has improved
progress on pure Dafny verification from 68% to 96% over the past year. The
benchmark and vericoding results are shared at
https://github.com/Beneficial-AI-Foundation/vericoding-benchmark

| Search Query: ArXiv Query: search_query=au:”Max Tegmark”&id_list=&start=0&max_results=3