AI Engineer April 10, 2026

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Summary

The talk focuses on using Large Language Models (LLMs) as judges for evaluating and monitoring AI agent performance, highlighting the challenges of hallucination detection and the need for calibrated evaluation techniques. The speaker introduces the concept of using prompt optimization algorithms like GAPA to align LLM judges with human annotations, addressing the critical bottleneck of evaluation speed in AI development. The key practical takeaway is that having a calibrated LLM judge that correlates with human annotation quality can significantly accelerate both offline development iterations and online production monitoring, enabling faster and more accurate performance assessment of AI systems.

View original episode ↗

Mobile experience coming soon

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Summary