AI code benchmarks lied to us
Summary
The transcript discusses the reliability and validity of current AI coding benchmarks, arguing that existing tests like SWEBench Pro are flawed and do not accurately represent real-world coding challenges. The speaker criticizes these benchmarks for being contaminated, allowing models to cheat, and not reflecting the actual problems developers solve with AI agents. The key takeaway is the introduction of DBSE, a new benchmark that promises to more accurately measure AI coding performance by simulating realistic project development scenarios.