Skip to main content

2 posts tagged with "vLLM"

View All Tags

Deploying MiniMax M2.1 with vLLM: Complete Guide for Agentic Workloads

· 10 min read
Atharva Ingle
Atharva Ingle
AI Engineer @E2E Networks

MiniMax M2.1 vLLM Deployment Cover

If you're building agentic applications or coding assistants, you've probably noticed that most open-source models fall short on tool calling and multi-step reasoning. MiniMax M2.1 changes that. Released on December 23, 2025, it's currently the strongest open-source model for agentic workloads, matching or beating Claude Sonnet on benchmarks like tau2-Bench, BrowseComp, and GAIA.

What makes M2.1 practical to deploy is its architecture. It's a Mixture-of-Experts model with 230 billion total parameters, but only 10 billion activate per forward pass. You get frontier-class performance on tool calling and software engineering tasks while running inference at a fraction of the compute. The model is MIT-licensed and works out of the box with Cline, Roo Code, OpenCode, and Claude Code.

This guide covers vLLM deployment, benchmarking, tool calling with M2.1's interleaved thinking feature, and integration with coding terminals.

Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference

· 34 min read
Jaydev Tonde
Jaydev Tonde
Data Scientist

Speculative Decoding vLLM Cover

Introduction

Ever waited for an AI chatbot to finish its answer, watching the text appear word by word slow? It can feel painfully slow, especially when you need a fast response from a powerful Large Language Model (LLM).

The big problem is in how LLMs generate text. They don't just write a paragraph all at once; they follow a strict, word-by-word approach.

  1. The model looks at the prompt and the words it has generated so far.
  2. It calculates the best next word (token).
  3. It adds that word to the text.
  4. It repeats the whole process for the next word.

Each step involves complex calculations, meaning the more text you ask for, the longer the wait. For developers building real-time applications (like chatbots, code assistants, or RAG systems), this slowness (high latency) is a major problem.