Benchmarks for AI-assisted Formal Verification

Theodore Ehrenborg · Beneficial AI Foundation / PIBBSS

March 2026

LLMs can generate mathematical proofs, but can they verify *software*? This talk presents two benchmarks for AI-assisted formal verification: the largest known suite spanning Lean (7,141), Verus (2,334), and Dafny (3,029) examples — with success rates of 27%, 44%, and 82% respectively — and a harder set of manually-checked specifications drawn from the curve25519-dalek production cryptography library, where some proofs take a human expert days to complete.

Readings

arXiv:2509.22908