Skip to main content

3 posts tagged with "Inference"

View All Tags

vLLM Optimization Techniques: 5 Practical Methods to Improve Performance

· 26 min read
Jaydev Tonde
Jaydev Tonde
Data Scientist

vLLM optimization techniques cover artwork with five performance methods highlighted

Running large language models efficiently can be challenging. You want good performance without overloading your servers or exceeding your budget. That's where vLLM comes in - but even this powerful inference engine can be made faster and smarter.

In this post, we'll explore five cutting-edge optimization techniques that can dramatically improve your vLLM performance:

  1. Prefix Caching - Stop recomputing what you've already computed
  2. FP8 KV-Cache - Pack more memory efficiency into your cache
  3. CPU Offloading - Make your CPU and GPU work together
  4. Disaggregated P/D - Split processing and serving for better scaling
  5. Zero Reload Sleep Mode - Keep your models warm without wasting resources

Each technique addresses a different bottleneck, and together they can significantly improve your inference pipeline performance. Let's explore how these optimizations work.

Disaggregated Prefill-Decode: The Architecture Behind Meta's LLM Serving

· 11 min read
Vishnu Subramanian
Founder @JarvisLabs.ai

Disaggregated Prefill-Decode Architecture

Why I'm Writing This Series

I've been deep in research mode lately, studying how to optimize LLM inference. The goal is to eventually integrate these techniques into JarvisLabs - making it easier for our users to serve models efficiently without having to become infrastructure experts themselves.

As I learn, I want to share what I find. This series is part research notes, part explainer. If you're trying to understand LLM serving optimization, hopefully my journey saves you some time.

This first post covers disaggregated prefill-decode - a pattern I discovered while reading through the vLLM router repository. Meta's team has been working closely with vLLM on this, and it solves a fundamental problem that's been on my mind.

Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference

· 34 min read
Jaydev Tonde
Jaydev Tonde
Data Scientist

Speculative Decoding vLLM Cover

Introduction

Ever waited for an AI chatbot to finish its answer, watching the text appear word by word slow? It can feel painfully slow, especially when you need a fast response from a powerful Large Language Model (LLM).

The big problem is in how LLMs generate text. They don't just write a paragraph all at once; they follow a strict, word-by-word approach.

  1. The model looks at the prompt and the words it has generated so far.
  2. It calculates the best next word (token).
  3. It adds that word to the text.
  4. It repeats the whole process for the next word.

Each step involves complex calculations, meaning the more text you ask for, the longer the wait. For developers building real-time applications (like chatbots, code assistants, or RAG systems), this slowness (high latency) is a major problem.