Skip to main content

One post tagged with "Quantization"

View All Tags

The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices

· 45 min read
Jaydev Tonde
Jaydev Tonde
Data Scientist

vLLM Quantization Benchmark Results

Introduction

If you've worked with large language models, you've probably run into a common problem: these models are huge and need a lot of GPU memory to run. A 32B parameter model can easily eat up 60+ GB of memory in its default form. That's where quantization comes in.

What is quantization? Simply put, it's the process of reducing the precision of model weights. Instead of storing each weight as a 16-bit floating point number, we can store it as a 4-bit or 8-bit integer. This makes the model smaller and faster to run.

In this blog post, we are going to:

  1. Learn about different quantization techniques available in vLLM
  2. See how each one works under the hood
  3. Run actual benchmarks on an H200 GPU using Qwen2.5-32B-Instruct
  4. Help you decide which technique to use for your use case

The techniques we'll cover include AWQ, GPTQ, Marlin, BitBLAS, GGUF, BitsandBytes, and more. We'll test both 4-bit quantization and measure three things:

  1. perplexity (model quality),
  2. code generation accuracy (HumanEval),
  3. and inference speed (ShareGPT benchmark).

Let's get started.