How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
Summary
The presentation focuses on Evaluation and Wellness (EVWs) in AI application development, highlighting the critical importance of measuring and testing AI systems that produce nondeterministic outputs. The speaker, an Adobe lead engineer, emphasizes the need for robust testing methods, metrics, and tools to assess AI applications' performance, alignment with goals, and trustworthiness. Key challenges discussed include testing prompt variations, measuring accuracy, and ensuring continuous improvement in AI systems. The practical takeaway is that developing comprehensive evaluation strategies is essential for creating reliable, accountable, and increasingly capable AI applications.