AI Engineer June 6, 2025

AI Engineer World’s Fair 2025 - Reasoning + RL

Summary

The main theme is using Reinforcement Learning (RL) techniques like PPO and GRPO to improve large language models by leveraging non-deterministic rollouts and their associated advantages. Key subjects include sampling, advantage estimation, instruction tuning, and the challenges of navigating a rapidly evolving research landscape. The practical takeaway is that GRPO offers a computationally efficient and effective middle ground for surgically improving model behavior by learning from the subtle differences in successful and unsuccessful generations.

View original episode ↗

Mobile experience coming soon

AI Engineer World’s Fair 2025 - Reasoning + RL

Summary