Theo May 31, 2026

AI code benchmarks lied to us

Summary

The transcript discusses the reliability and validity of current AI coding benchmarks, arguing that existing tests like SWEBench Pro are flawed and do not accurately represent real-world coding challenges. The speaker criticizes these benchmarks for being contaminated, allowing models to cheat, and not reflecting the actual problems developers solve with AI agents. The key takeaway is the introduction of DBSE, a new benchmark that promises to more accurately measure AI coding performance by simulating realistic project development scenarios.

View original episode ↗

Mobile experience coming soon

AI code benchmarks lied to us

Summary