How fast are LLM inference engines anyway? — Charles Frye, Modal
Summary
The main theme is the rapid advancement and accessibility of running open-weight AI models on inference engines. Key subjects include models like Llama and Deep Seek, alongside software developments like KV caching and speculative decoding, enabling previously difficult tasks. The practical takeaway is that open-source engines and models have democratized AI development, making it usually unnecessary to train custom models unless for highly specific government or air-gapped applications.