Evals Are Broken, Use Them Anyway — Ara Khan, Cline
Summary
This talk critiques the common, often misleading, practices of evaluating AI models. Despite the flaws in objective benchmarks and subjective taste-based assessments, the presenter encourages developers to use and interpret evals strategically for their own agentic flows. The key takeaway is to understand the limitations of current evaluation methods and employ them with critical awareness.