Skip to main content

Serverless

JarvisLabs Serverless lets you deploy an LLM as an autoscaling, OpenAI-compatible inference endpoint. You pick a model, framework and GPU; the platform provisions GPU workers, downloads the model to persistent storage, routes requests to healthy workers, and scales workers up and down with demand. You pay for workers only while they're running, plus storage for as long as the deployment exists.

Key features

  • OpenAI-compatible API — call chat/completions (and other v1 routes) with the OpenAI SDK or plain HTTP, including streaming responses.
  • Multiple frameworksvLLM, SGLang and Ollama.
  • Autoscaling — workers scale between your min and max on demand, with optional scale-to-zero when idle.
  • Concurrent requests — set how many requests each worker handles at once, as many as the model and GPU can sustain.

How it works

When you create a deployment, the platform sets it up in order — storage first, then the model, and only then the GPU workers that serve it:

Because the model is downloaded to persistent storage before any worker starts, scaled-up and replacement workers boot quickly — they use the model that's already on storage instead of downloading it again.

Once a deployment is running, each request is routed to a worker that has free capacity. If every worker is busy, the platform adds another worker (up to max_workers). Idle workers are removed after the idle timeout, while min_workers stay always-on.

Deployment and worker states

The dashboard shows the current state of each deployment and of every worker under it.

Deployment states

StateWhat it means
StartingThe deployment has been created and setup is beginning.
Filesystem createdThe persistent file storage for the model has been created.
Downloading modelThe model weights are being downloaded into the storage.
Model downloadedThe model is fully downloaded; workers are about to start.
RunningSetup is complete and the deployment is serving. Only running deployments accept inference requests.
FailedSetup failed at one of the steps above; the reason is shown on the deployment.
DeletingDeletion was requested; workers and storage are being torn down.
CleaningThe system is tearing the deployment down (for example, after a setup failure).
DeletedThe deployment and its storage have been removed.

Worker states

Each worker is a GPU container with its own state, shown on the deployment's Workers tab.

StateWhat it means
ProvisioningThe GPU worker is being created and its engine is starting up. It isn't serving yet.
HealthyThe worker passed its health check and is serving requests.
FailedThe worker couldn't start, or stopped responding to health checks. It's removed and, where needed, replaced.

Workers are also removed — not as a failure — when they're idle-reaped after the idle timeout, or when the deployment is deleted. min_workers are never idle-reaped.

Creating a deployment

Create a deployment from the dashboard:

  1. Open the Serverless dashboard and click New deployment.
  2. Choose a framework (vLLM, SGLang or Ollama).
  3. Configure the deployment (see configuration below).
  4. Click Create and wait for the deployment to reach Running.

Configuration

SettingDescription
ModelThe model to serve. For vLLM/SGLang this is a Hugging Face repo id (e.g. Qwen/Qwen2.5-7B-Instruct); for Ollama it's a model tag (e.g. qwen2.5:7b).
GPU type & countThe GPU to run on, and how many GPUs per worker (for multi-GPU / larger models).
Min workersAlways-on workers, kept running even when idle for instant response. Set to 0 for scale-to-zero.
Max workersUpper bound the platform will scale to under load.
Concurrent requestsHow many requests a single worker handles at once before another worker is needed — as many as the model and GPU can sustain.
Idle timeoutHow long an idle (non-min) worker stays up before it's reaped.
Wait timeSeconds a request waits for a busy worker to consume it for processing before timing out.
Storage sizeSize of the persistent file storage that holds the model.
Environment variablesPassed to the worker — e.g. HF_TOKEN for gated Hugging Face models.
Framework argsExtra arguments forwarded to the framework (e.g. enforce-eager, tensor-parallel-size).
Gated models need a token

For a gated Hugging Face model, add your token as the HF_TOKEN environment variable on the deployment. Without it the model download fails with a Hugging Face authentication error. For a first deployment, prefer an open model such as Qwen/Qwen2.5-7B-Instruct (vLLM/SGLang) or qwen2.5:7b (Ollama).

Storage and billing

Storage

Each deployment gets its own persistent file storage, created when the deployment starts. The model is downloaded into this storage once and kept there for the life of the deployment. You choose the storage size when you create the deployment. Each worker also runs on a GPU with its own 50 GB disk.

Billing

Billing has two independent parts:

  • Workers — each worker is billed only while it is running, and that covers both its GPU and the 50 GB disk that comes with it — they're billed together, only for as long as the worker exists.
    • Minimum workers stay up continuously, so they are billed continuously.
    • Autoscaled workers are billed only for the time they are up. When traffic drops and an extra worker is idle-reaped, billing for that worker — GPU and disk — stops.
  • Persistent storage — the file storage that holds your model is billed continuously for as long as the deployment exists, regardless of how many workers are running. So even with min_workers = 0 (scale-to-zero) — where you stop paying for workers when there's no traffic — you keep paying for this storage, because your model stays stored and ready to serve.

The dashboard shows the accrued cost for each deployment, so you can always see what a deployment is costing.

Authentication

API key required

Every request needs a JarvisLabs API key, sent as a bearer token. Generate one from your API settings and send it as Authorization: Bearer <your_api_key>.

Making API requests

The recommended way to call a deployment is the OpenAI-compatible endpoint, which responds synchronously and supports streaming.

The OpenAI base URL for a deployment is:

https://<region-base-url>/openai/{deployment_id}/v1

Using the OpenAI SDK

Point the OpenAI client's base_url at your deployment. The model you pass must match the model the deployment serves.

Non-streaming
from openai import OpenAI

client = OpenAI(
base_url="https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."},
],
temperature=0.7,
)

print(response.choices[0].message.content)

Using HTTP directly

Chat completion
curl https://serverlessn.jarvislabs.net/openai/{deployment_id}/v1/chat/completions \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."}
]
}'

For a streaming response, add "stream": true to the body — the endpoint returns a text/event-stream of OpenAI-style chunks.

Async submit / fetch

If you'd rather not hold a connection open — or you're calling a non-chat route — use the asynchronous pattern. Submit returns a request id immediately; you then poll for the result. The payload to forward to the worker goes under an input key, and the URL path after the deployment id is the route hit on the worker.

Submit — returns a request id
curl -X POST \
https://serverlessn.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"input": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}
}'

# Response
{ "id": "<request_id>" }

Request outcomes

A successful request returns the framework's response (an OpenAI chat-completion object for chat/completions). Failures come back as a real HTTP error with a short, safe-to-show message:

HTTPMeaningMessage
400Invalid request — the framework rejected your parameters. The framework's own error body is relayed verbatim.(framework error)
404Deployment not found, not running, or unknown request id.Deployment not found or not running
408No response within the request window.Request timeout
429The deployment's queue is full.Server busy, please retry
502A worker errored or was unreachable.Worker returned an error, please retry
503No GPUs were available to scale up.No GPUs available for scale-up, please retry
504A worker is still cold-starting, or inference timed out.A worker is still starting, please retry shortly
500Unexpected internal error.Request failed, please retry
tip

Most failures are transient (a cold start, a momentarily full queue, or no free GPU at that instant). Retrying after a short delay is the right response to 408, 429, 502, 503 and 504. A 400 means the request body itself needs fixing.

Monitoring a deployment

Open a deployment from the dashboard to see:

  • Overview — status, region, framework, model and cost.
  • Workers — current workers and their health.
  • Logs — live worker logs (useful while a deployment is starting up).
  • Usage — request and cost trends.

Managing a deployment

  • Edit — name, idle timeout and wait time can be changed on a running deployment. To change the model, GPU or worker counts, create a new deployment.
  • Delete — tears down all workers and the deployment's storage. This cannot be undone.

FAQ

Which frameworks are supported?

vLLM, SGLang and Ollama.

Can I use the OpenAI SDK?

Yes. Point the client's base_url at your deployment's /openai/{deployment_id}/v1 URL and pass your API key — chat/completions, streaming included, works as usual.

Do I pay when there's no traffic?

If min_workers is 0 (scale-to-zero), you stop paying for workers once they're idle and reaped — but you keep paying for storage. If min_workers is 1 or more, those workers stay up and are billed continuously.

Why is the first request after a quiet period slow?

That's a cold start — a worker has to boot before it can serve. Once a worker is healthy, requests are fast. Boots are quick because the model is already on storage.

How do I serve a gated Hugging Face model?

Add your token as the HF_TOKEN environment variable on the deployment.

What happens to a request that arrives while a worker is still starting?

It waits for the worker. If the worker isn't ready within the request window, you get a "worker is still starting, please retry shortly" response — just retry in a moment.