Serverless

JarvisLabs Serverless lets you deploy an LLM as an autoscaling, OpenAI-compatible inference endpoint. You pick a model, framework and GPU; the platform provisions GPU workers, downloads the model to persistent storage, routes requests to healthy workers, and scales workers up and down with demand. You pay for workers only while they're running, plus storage for as long as the deployment exists.

Key features

OpenAI-compatible API — call chat/completions (and other v1 routes) with the OpenAI SDK or plain HTTP, including streaming responses.
Multiple frameworks — vLLM, SGLang and Ollama.
Autoscaling — workers scale between your min and max on demand, with optional scale-to-zero when idle.
Concurrent requests — set how many requests each worker handles at once, as many as the model and GPU can sustain.

How it works

When you create a deployment, the platform sets it up in order — storage first, then the model, and only then the GPU workers that serve it:

Because the model is downloaded to persistent storage before any worker starts, scaled-up and replacement workers boot quickly — they use the model that's already on storage instead of downloading it again.

Once a deployment is running, each request is routed to a worker that has free capacity. If every worker is busy, the platform adds another worker (up to max_workers). Idle workers are removed after the idle timeout, while min_workers stay always-on.

Deployment and worker states

The dashboard shows the current state of each deployment and of every worker under it.

Deployment states

State	What it means
Starting	The deployment has been created and setup is beginning.
Filesystem created	The persistent file storage for the model has been created.
Downloading model	The model weights are being downloaded into the storage.
Model downloaded	The model is fully downloaded; workers are about to start.
Running	Setup is complete and the deployment is serving. Only `running` deployments accept inference requests.
Failed	Setup failed at one of the steps above; the reason is shown on the deployment.
Deleting	Deletion was requested; workers and storage are being torn down.
Cleaning	The system is tearing the deployment down (for example, after a setup failure).
Deleted	The deployment and its storage have been removed.

Worker states

Each worker is a GPU container with its own state, shown on the deployment's Workers tab.

State	What it means
Provisioning	The GPU worker is being created and its engine is starting up. It isn't serving yet.
Healthy	The worker passed its health check and is serving requests.
Failed	The worker couldn't start, or stopped responding to health checks. It's removed and, where needed, replaced.

Workers are also removed — not as a failure — when they're idle-reaped after the idle timeout, or when the deployment is deleted. min_workers are never idle-reaped.

Creating a deployment

Create a deployment from the dashboard:

Open the Serverless dashboard and click New deployment.
Choose a framework (vLLM, SGLang or Ollama).
Configure the deployment (see configuration below).
Click Create and wait for the deployment to reach Running.

Configuration

Setting	Description
Model	The model to serve. For vLLM/SGLang this is a Hugging Face repo id (e.g. `Qwen/Qwen2.5-7B-Instruct`); for Ollama it's a model tag (e.g. `qwen2.5:7b`).
GPU type & count	The GPU to run on, and how many GPUs per worker (for multi-GPU / larger models).
Min workers	Always-on workers, kept running even when idle for instant response. Set to `0` for scale-to-zero.
Max workers	Upper bound the platform will scale to under load.
Concurrent requests	How many requests a single worker handles at once before another worker is needed — as many as the model and GPU can sustain.
Idle timeout	How long an idle (non-min) worker stays up before it's reaped.
Wait time	Seconds a request waits for a busy worker to consume it for processing before timing out.
Storage size	Size of the persistent file storage that holds the model.
Environment variables	Passed to the worker — e.g. `HF_TOKEN` for gated Hugging Face models.
Framework args	Extra arguments forwarded to the framework (e.g. `enforce-eager`, `tensor-parallel-size`).

Gated models need a token

For a gated Hugging Face model, add your token as the HF_TOKEN environment variable on the deployment. Without it the model download fails with a Hugging Face authentication error. For a first deployment, prefer an open model such as Qwen/Qwen2.5-7B-Instruct (vLLM/SGLang) or qwen2.5:7b (Ollama).

Storage and billing

Storage

Each deployment gets its own persistent file storage, created when the deployment starts. The model is downloaded into this storage once and kept there for the life of the deployment. You choose the storage size when you create the deployment. Each worker also runs on a GPU with its own 50 GB disk.

Billing

Billing has two independent parts:

Workers — each worker is billed only while it is running, and that covers both its GPU and the 50 GB disk that comes with it — they're billed together, only for as long as the worker exists.
- Minimum workers stay up continuously, so they are billed continuously.
- Autoscaled workers are billed only for the time they are up. When traffic drops and an extra worker is idle-reaped, billing for that worker — GPU and disk — stops.
Persistent storage — the file storage that holds your model is billed continuously for as long as the deployment exists, regardless of how many workers are running. So even with min_workers = 0 (scale-to-zero) — where you stop paying for workers when there's no traffic — you keep paying for this storage, because your model stays stored and ready to serve.

The dashboard shows the accrued cost for each deployment, so you can always see what a deployment is costing.

Authentication

API key required

Every request needs a JarvisLabs API key, sent as a bearer token. Generate one from your API settings and send it as Authorization: Bearer <your_api_key>.

Making API requests

The recommended way to call a deployment is the OpenAI-compatible endpoint, which responds synchronously and supports streaming.

The OpenAI base URL for a deployment is:

https://<region-base-url>/openai/{deployment_id}/v1

Using the OpenAI SDK

Point the OpenAI client's base_url at your deployment. The model you pass must match the model the deployment serves.

Non-streaming
Streaming

Non-streaming
from openai import OpenAI

client = OpenAI(
    base_url="https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1",
    api_key="<your_api_key>",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about the history of the moon."},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

Streaming
from openai import OpenAI

client = OpenAI(
    base_url="https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1",
    api_key="<your_api_key>",
)

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Tell me about the history of the moon."},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Using HTTP directly

cURL
Python (requests)

Chat completion
curl https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1/chat/completions \
  -H "Authorization: Bearer <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me about the history of the moon."}
    ]
  }'

For a streaming response, add "stream": true to the body — the endpoint returns a text/event-stream of OpenAI-style chunks.

Chat completion
import requests

resp = requests.post(
    "https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1/chat/completions",
    headers={"Authorization": "Bearer <your_api_key>"},
    json={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "Tell me about the history of the moon."},
        ],
    },
)
print(resp.json()["choices"][0]["message"]["content"])

Async submit / fetch

If you'd rather not hold a connection open — or you're calling a non-chat route — use the asynchronous pattern. Submit returns a request id immediately; you then poll for the result. The payload to forward to the worker goes under an input key, and the URL path after the deployment id is the route hit on the worker.

1. Submit
2. Fetch

Submit — returns a request id
curl -X POST \
  https://serverlessn.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions \
  -H "Authorization: Bearer <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "model": "Qwen/Qwen2.5-7B-Instruct",
      "messages": [{"role": "user", "content": "Hello"}]
    }
  }'

# Response
{ "id": "<request_id>" }

Fetch — poll until ready
curl https://serverlessn.jarvislabs.net/deployment/{deployment_id}/<request_id> \
  -H "Authorization: Bearer <your_api_key>"

# 202 while still processing; the completion object once ready.

Request outcomes

A successful request returns the framework's response (an OpenAI chat-completion object for chat/completions). Failures come back as a real HTTP error with a short, safe-to-show message:

HTTP	Meaning	Message
`400`	Invalid request — the framework rejected your parameters. The framework's own error body is relayed verbatim.	(framework error)
`404`	Deployment not found, not running, or unknown request id.	Deployment not found or not running
`408`	No response within the request window.	Request timeout
`429`	The deployment's queue is full.	Server busy, please retry
`502`	A worker errored or was unreachable.	Worker returned an error, please retry
`503`	No GPUs were available to scale up.	No GPUs available for scale-up, please retry
`504`	A worker is still cold-starting, or inference timed out.	A worker is still starting, please retry shortly
`500`	Unexpected internal error.	Request failed, please retry

tip

Most failures are transient (a cold start, a momentarily full queue, or no free GPU at that instant). Retrying after a short delay is the right response to 408, 429, 502, 503 and 504. A 400 means the request body itself needs fixing.

Monitoring a deployment

Open a deployment from the dashboard to see:

Overview — status, region, framework, model and cost.
Workers — current workers and their health.
Logs — live worker logs (useful while a deployment is starting up).
Usage — request and cost trends.

Managing a deployment

Edit — name, idle timeout and wait time can be changed on a running deployment. To change the model, GPU or worker counts, create a new deployment.
Delete — tears down all workers and the deployment's storage. This cannot be undone.

FAQ

Which frameworks are supported?

vLLM, SGLang and Ollama.

Can I use the OpenAI SDK?

Yes. Point the client's base_url at your deployment's /openai/{deployment_id}/v1 URL and pass your API key — chat/completions, streaming included, works as usual.

Do I pay when there's no traffic?

If min_workers is 0 (scale-to-zero), you stop paying for workers once they're idle and reaped — but you keep paying for storage. If min_workers is 1 or more, those workers stay up and are billed continuously.

Why is the first request after a quiet period slow?

That's a cold start — a worker has to boot before it can serve. Once a worker is healthy, requests are fast. Boots are quick because the model is already on storage.

How do I serve a gated Hugging Face model?

Add your token as the HF_TOKEN environment variable on the deployment.

What happens to a request that arrives while a worker is still starting?

It waits for the worker. If the worker isn't ready within the request window, you get a "worker is still starting, please retry shortly" response — just retry in a moment.

Key features​

How it works​

Deployment and worker states​

Deployment states​

Worker states​

Creating a deployment​

Configuration​

Storage and billing​

Storage​

Billing​

Authentication​

Making API requests​

Using the OpenAI SDK​

Using HTTP directly​

Async submit / fetch​

Request outcomes​

Monitoring a deployment​

Managing a deployment​

FAQ​