Serving LLMs with Ollama and vLLM

Open-weight models like Qwen, Llama, and Mistral let you run LLMs locally on your own hardware. To self-host LLMs, you need an inference server that handles requests, manages GPU memory, and exposes an API your application can call. This tutorial covers the two most popular options: Ollama for quick experimentation and vLLM for production LLM deployment.

Both expose an OpenAI-compatible API, so the same client code works with either. We'll use Qwen2.5-7B as the example model.

The Most Popular Two Options

Ollama handles everything automatically. One command installs it, another downloads and runs a model. It manages quantization, memory allocation, and serving without configuration. The tradeoff is less control and lower throughput.

vLLM is the standard for production LLM inference. It uses PagedAttention to manage GPU memory efficiently and continuous batching to maximize throughput. You configure it explicitly, but you get much higher request throughput on the same hardware.

Setup

Create a JarvisLabs instance with an A100 40GB and the PyTorch template.

Ollama

Install with one command:

curl -fsSL https://ollama.com/install.sh | sh

Start the Ollama server in one terminal:

ollama serve

Then, in another terminal, run Qwen2.5:

ollama run qwen2.5:7b

The first run downloads the model (about 4-5GB for the quantized version). After that, you're in an interactive chat:

Ollama interactive chat session with Qwen 2.5 7B model

Press Ctrl+D to exit. Ollama keeps the server running in the background.

Ollama exposes an OpenAI-compatible API, so you can use the same code you'd use with OpenAI's API. Test it with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Ollama API curl request and JSON response

You can manage models with a few simple commands:

ollama list            # See downloaded models
ollama pull qwen2.5:7b # Download a model without running it
ollama rm qwen2.5:7b   # Delete a model to free up space

vLLM

vLLM is a single pip install:

pip install vllm

Start the server by specifying the model you want to serve. vLLM downloads it automatically from HuggingFace:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

A few flags worth knowing:

--dtype bfloat16 uses native precision for A100s
--max-model-len 8192 caps sequence length to save memory
--gpu-memory-utilization 0.9 reserves 90% of VRAM for the model + KV cache

Once you see "Uvicorn running on http://0.0.0.0:8000", the server is ready. Like Ollama, vLLM exposes an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "What is vLLM?"}],
    "max_tokens": 500
  }'

You should see output something like this:

vLLM API curl request and JSON response

Accessing Your API Externally

The curl commands above use localhost, which works when you're running them in the JupyterLab terminal on your JarvisLabs instance. But what if you want to:

Access the API from your browser
Connect from an application running on your local machine
Share the endpoint with colleagues

JarvisLabs exposes port 6006 as a public API endpoint. Any service running on this port gets a public URL you can access from anywhere.

Ollama on Port 6006

Set the OLLAMA_HOST environment variable to change the port:

OLLAMA_HOST=0.0.0.0:6006 ollama serve

Then run your model as usual:

ollama run qwen2.5:7b

vLLM on Port 6006

Change the --port flag:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 6006 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Getting Your Public URL

Once your server is running on port 6006, get the public URL from the JarvisLabs dashboard. Click the API button on your instance to see the endpoint:

JarvisLabs dashboard showing API endpoint button and dropdown

Click the copy icon as seen above 👆. You can now use this URL from anywhere:

curl https://<your-instance-url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SSH Port Forwarding

The public URL works well for sharing with others, but sometimes you just want private access from your own machine. SSH port forwarding creates a tunnel that connects a port on your local machine to a port on the JarvisLabs instance. If you haven't set up SSH yet, follow the SSH setup guide first.

SSH port forwarding diagram showing localhost connected to JarvisLabs vLLM via tunnel

Get the SSH command from the dashboard by clicking the copy icon next to SSH:

ssh -o StrictHostKeyChecking=no -p 11114 root@sshd.jarvislabs.ai

Add -L 8000:localhost:8000 to forward port 8000. This means: forward my local port 8000 → to localhost:8000 on the remote instance.

ssh -L 8000:localhost:8000 -o StrictHostKeyChecking=no -p 11114 root@sshd.jarvislabs.ai

Keep this terminal open. Start vLLM on port 8000 inside the instance (via JupyterLab or another terminal), and you can hit it from your local machine:

Curl request from local terminal to vLLM on JarvisLabs via SSH tunnel

The model runs on JarvisLabs GPUs, but from your machine it's just localhost.

Using the OpenAI Python Client

Since both Ollama and vLLM expose OpenAI-compatible APIs, you can use the official OpenAI Python client to interact with them. Point it at localhost when running code inside your JarvisLabs JupyterLab instance (or via SSH port forwarding as we learned above), or use your public API endpoint URL to connect from anywhere else:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",  # Or your JarvisLabs API endpoint URL
    api_key="dummy"  # Required by the client but ignored by local servers
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in one sentence."}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

You should see output something like this:

Machine learning is a method of teaching computers to recognize patterns in data and make predictions or decisions without being explicitly programmed.

Streaming responses work just like they do with OpenAI's API:

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about code"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

This makes it easy to swap between local models and OpenAI in your applications. Just change the base_url.

Which One to Use

Ollama works best for local AI development and quick experiments. The automatic model management saves time when trying different models. It runs well on consumer hardware and requires no configuration.

vLLM makes sense for production deployments. PagedAttention and continuous batching provide significantly higher throughput under load. If you're building an API that serves multiple concurrent users, vLLM handles the load more efficiently.

For a single user running experiments, Ollama is simpler. For an application serving requests, vLLM scales better. When choosing the best LLM to run locally, consider your throughput requirements and whether you need the flexibility of Ollama's model management or vLLM's production-grade performance.

Next Steps

You can start experimenting with this on JarvisLabs. We have A5000, A6000, A100, H100, and H200 GPUs available. Check the pricing page for current rates.

If you run into any issues or have questions, reach out and let us know.

The Most Popular Two Options​

Setup​

Ollama​

vLLM​

Accessing Your API Externally​

Ollama on Port 6006​

vLLM on Port 6006​

Getting Your Public URL​

SSH Port Forwarding​

Using the OpenAI Python Client​

Which One to Use​

Next Steps​