Serverless
JarvisLabs Serverless lets you deploy an LLM as an autoscaling, OpenAI-compatible inference endpoint. You pick a model, framework and GPU; the platform provisions GPU workers, downloads the model to persistent storage, routes requests to healthy workers, and scales workers up and down with demand. You pay for workers only while they're running, plus storage for as long as the deployment exists.
Key features
- OpenAI-compatible API — call
chat/completions(and otherv1routes) with the OpenAI SDK or plain HTTP, including streaming responses. - Multiple frameworks — vLLM, SGLang and Ollama.
- Autoscaling — workers scale between your min and max on demand, with optional scale-to-zero when idle.
- Concurrent requests — set how many requests each worker handles at once, as many as the model and GPU can sustain.
How it works
When you create a deployment, the platform sets it up in order — storage first, then the model, and only then the GPU workers that serve it:
Because the model is downloaded to persistent storage before any worker starts, scaled-up and replacement workers boot quickly — they use the model that's already on storage instead of downloading it again.
Once a deployment is running, each request is routed to a worker that has free
capacity. If every worker is busy, the platform adds another worker (up to
max_workers). Idle workers are removed after the idle timeout, while min_workers
stay always-on.
Deployment and worker states
The dashboard shows the current state of each deployment and of every worker under it.
Deployment states
| State | What it means |
|---|---|
| Starting | The deployment has been created and setup is beginning. |
| Filesystem created | The persistent file storage for the model has been created. |
| Downloading model | The model weights are being downloaded into the storage. |
| Model downloaded | The model is fully downloaded; workers are about to start. |
| Running | Setup is complete and the deployment is serving. Only running deployments accept inference requests. |
| Failed | Setup failed at one of the steps above; the reason is shown on the deployment. |
| Deleting | Deletion was requested; workers and storage are being torn down. |
| Cleaning | The system is tearing the deployment down (for example, after a setup failure). |
| Deleted | The deployment and its storage have been removed. |
Worker states
Each worker is a GPU container with its own state, shown on the deployment's Workers tab.
| State | What it means |
|---|---|
| Provisioning | The GPU worker is being created and its engine is starting up. It isn't serving yet. |
| Healthy | The worker passed its health check and is serving requests. |
| Failed | The worker couldn't start, or stopped responding to health checks. It's removed and, where needed, replaced. |
Workers are also removed — not as a failure — when they're idle-reaped after the idle
timeout, or when the deployment is deleted. min_workers are never idle-reaped.
Creating a deployment
Create a deployment from the dashboard:
- Open the Serverless dashboard and click New deployment.
- Choose a framework (vLLM, SGLang or Ollama).
- Configure the deployment (see configuration below).
- Click Create and wait for the deployment to reach Running.
Configuration
| Setting | Description |
|---|---|
| Model | The model to serve. For vLLM/SGLang this is a Hugging Face repo id (e.g. Qwen/Qwen2.5-7B-Instruct); for Ollama it's a model tag (e.g. qwen2.5:7b). |
| GPU type & count | The GPU to run on, and how many GPUs per worker (for multi-GPU / larger models). |
| Min workers | Always-on workers, kept running even when idle for instant response. Set to 0 for scale-to-zero. |
| Max workers | Upper bound the platform will scale to under load. |
| Concurrent requests | How many requests a single worker handles at once before another worker is needed — as many as the model and GPU can sustain. |
| Idle timeout | How long an idle (non-min) worker stays up before it's reaped. |
| Wait time | Seconds a request waits for a busy worker to consume it for processing before timing out. |
| Storage size | Size of the persistent file storage that holds the model. |
| Environment variables | Passed to the worker — e.g. HF_TOKEN for gated Hugging Face models. |
| Framework args | Extra arguments forwarded to the framework (e.g. enforce-eager, tensor-parallel-size). |
For a gated Hugging Face model, add your token as the HF_TOKEN environment
variable on the deployment. Without it the model download fails with a Hugging Face
authentication error. For a first deployment, prefer an open model such as
Qwen/Qwen2.5-7B-Instruct (vLLM/SGLang) or qwen2.5:7b (Ollama).
Storage and billing
Storage
Each deployment gets its own persistent file storage, created when the deployment starts. The model is downloaded into this storage once and kept there for the life of the deployment. You choose the storage size when you create the deployment. Each worker also runs on a GPU with its own 50 GB disk.
Billing
Billing has two independent parts:
- Workers — each worker is billed only while it is running, and that covers both its
GPU and the 50 GB disk that comes with it — they're billed together, only for as long as
the worker exists.
- Minimum workers stay up continuously, so they are billed continuously.
- Autoscaled workers are billed only for the time they are up. When traffic drops and an extra worker is idle-reaped, billing for that worker — GPU and disk — stops.
- Persistent storage — the file storage that holds your model is billed continuously
for as long as the deployment exists, regardless of how many workers are running. So even
with
min_workers = 0(scale-to-zero) — where you stop paying for workers when there's no traffic — you keep paying for this storage, because your model stays stored and ready to serve.
The dashboard shows the accrued cost for each deployment, so you can always see what a deployment is costing.
Authentication
Every request needs a JarvisLabs API key, sent as a bearer token. Generate one from your
API settings and send it as
Authorization: Bearer <your_api_key>.
Making API requests
The recommended way to call a deployment is the OpenAI-compatible endpoint, which responds synchronously and supports streaming.
The OpenAI base URL for a deployment is:
https://<region-base-url>/openai/{deployment_id}/v1
Using the OpenAI SDK
Point the OpenAI client's base_url at your deployment. The model you pass must match
the model the deployment serves.
- Non-streaming
- Streaming
from openai import OpenAI
client = OpenAI(
base_url="https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1",
api_key="<your_api_key>",
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."},
],
temperature=0.7,
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url="https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1",
api_key="<your_api_key>",
)
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Tell me about the history of the moon."},
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Using HTTP directly
- cURL
- Python (requests)
curl https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1/chat/completions \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."}
]
}'
For a streaming response, add "stream": true to the body — the endpoint returns a
text/event-stream of OpenAI-style chunks.
import requests
resp = requests.post(
"https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1/chat/completions",
headers={"Authorization": "Bearer <your_api_key>"},
json={
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Tell me about the history of the moon."},
],
},
)
print(resp.json()["choices"][0]["message"]["content"])
Async submit / fetch
If you'd rather not hold a connection open — or you're calling a non-chat route — use the
asynchronous pattern. Submit returns a request id immediately; you then poll for the
result. The payload to forward to the worker goes under an input key, and the URL path
after the deployment id is the route hit on the worker.
- 1. Submit
- 2. Fetch
curl -X POST \
https://serverlessn.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"input": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}
}'
# Response
{ "id": "<request_id>" }
curl https://serverlessn.jarvislabs.net/deployment/{deployment_id}/<request_id> \
-H "Authorization: Bearer <your_api_key>"
# 202 while still processing; the completion object once ready.
Request outcomes
A successful request returns the framework's response (an OpenAI chat-completion object
for chat/completions). Failures come back as a real HTTP error with a short,
safe-to-show message:
| HTTP | Meaning | Message |
|---|---|---|
400 | Invalid request — the framework rejected your parameters. The framework's own error body is relayed verbatim. | (framework error) |
404 | Deployment not found, not running, or unknown request id. | Deployment not found or not running |
408 | No response within the request window. | Request timeout |
429 | The deployment's queue is full. | Server busy, please retry |
502 | A worker errored or was unreachable. | Worker returned an error, please retry |
503 | No GPUs were available to scale up. | No GPUs available for scale-up, please retry |
504 | A worker is still cold-starting, or inference timed out. | A worker is still starting, please retry shortly |
500 | Unexpected internal error. | Request failed, please retry |
Most failures are transient (a cold start, a momentarily full queue, or no free GPU at
that instant). Retrying after a short delay is the right response to 408, 429, 502,
503 and 504. A 400 means the request body itself needs fixing.
Monitoring a deployment
Open a deployment from the dashboard to see:
- Overview — status, region, framework, model and cost.
- Workers — current workers and their health.
- Logs — live worker logs (useful while a deployment is starting up).
- Usage — request and cost trends.
Managing a deployment
- Edit — name, idle timeout and wait time can be changed on a running deployment. To change the model, GPU or worker counts, create a new deployment.
- Delete — tears down all workers and the deployment's storage. This cannot be undone.
FAQ
Which frameworks are supported?
vLLM, SGLang and Ollama.
Can I use the OpenAI SDK?
Yes. Point the client's base_url at your deployment's /openai/{deployment_id}/v1 URL
and pass your API key — chat/completions, streaming included, works as usual.
Do I pay when there's no traffic?
If min_workers is 0 (scale-to-zero), you stop paying for workers once they're idle and
reaped — but you keep paying for storage. If min_workers is 1 or more, those workers
stay up and are billed continuously.
Why is the first request after a quiet period slow?
That's a cold start — a worker has to boot before it can serve. Once a worker is healthy, requests are fast. Boots are quick because the model is already on storage.
How do I serve a gated Hugging Face model?
Add your token as the HF_TOKEN environment variable on the deployment.
What happens to a request that arrives while a worker is still starting?
It waits for the worker. If the worker isn't ready within the request window, you get a "worker is still starting, please retry shortly" response — just retry in a moment.