AI Engineer World’s Fair 2025 - Reasoning + RL
Summary
The main theme is using Reinforcement Learning (RL) techniques like PPO and GRPO to improve large language models by leveraging non-deterministic rollouts and their associated advantages. Key subjects include sampling, advantage estimation, instruction tuning, and the challenges of navigating a rapidly evolving research landscape. The practical takeaway is that GRPO offers a computationally efficient and effective middle ground for surgically improving model behavior by learning from the subtle differences in successful and unsuccessful generations.