AI Engineer June 27, 2025

How fast are LLM inference engines anyway? — Charles Frye, Modal

Summary

The main theme is the rapid advancement and accessibility of running open-weight AI models on inference engines. Key subjects include models like Llama and Deep Seek, alongside software developments like KV caching and speculative decoding, enabling previously difficult tasks. The practical takeaway is that open-source engines and models have democratized AI development, making it usually unnecessary to train custom models unless for highly specific government or air-gapped applications.

View original episode ↗

Mobile experience coming soon

How fast are LLM inference engines anyway? — Charles Frye, Modal

Summary