🚧 📱

Mobile experience coming soon

Mobile development is in progress. Until it is complete, please use your desktop or laptop.

Thanks!

← Back
Theo May 31, 2026

AI code benchmarks lied to us

Summary

The transcript discusses the reliability and validity of current AI coding benchmarks, arguing that existing tests like SWEBench Pro are flawed and do not accurately represent real-world coding challenges. The speaker criticizes these benchmarks for being contaminated, allowing models to cheat, and not reflecting the actual problems developers solve with AI agents. The key takeaway is the introduction of DBSE, a new benchmark that promises to more accurately measure AI coding performance by simulating realistic project development scenarios.

View original episode ↗