DafnyBench: A Benchmark for Formal Software Verification – Kavli Institute Pre-Print Publications

Kavli Affiliate: Max Tegmark

| First 5 Authors: Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun

| Summary:

We introduce DafnyBench, the largest benchmark of its kind for training and
evaluating machine learning systems for formal software verification. We test
the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints
for the Dafny formal verification engine to successfully verify over 750
programs with about 53,000 lines of code. The best model and prompting scheme
achieved 68% success rate, and we quantify how this rate improves when retrying
with error message feedback and how it deteriorates with the amount of required
code and hints. We hope that DafnyBench will enable rapid improvements from
this baseline as LLMs and verification techniques grow in quality.

| Search Query: ArXiv Query: search_query=au:”Max Tegmark”&id_list=&start=0&max_results=3