Skip to main content

Serverless

JarvisLabs Serverless is a powerful feature that allows you to deploy and scale AI models efficiently. This documentation will guide you through how to use the service effectively.

Key Features

  • Currently supports VLLM framework, more frameworks coming soon.
  • Concurrent request handling (up to 100 concurrent requests per deployment if the underlying framework supports it)
  • Automatic scaling based on demand
  • Secure API access with authentication

Creating a Deployment

You can create a serverless deployment by using the dashboard.

  1. Visit JarvisLabs Serverless Dashboard
  2. Click on Framework of your choice
  3. Configure your deployment settings:
    • Select your model
    • Configure scaling options
    • Set up environment variables
  4. Click "Create" to launch your deployment

Deployment Configuration

When creating a deployment, you can specify these parameters:

  • Worker Scaling:

    • Minimum Workers: The number of workers that will always be running, even during periods of low activity. This ensures immediate response when requests come in.
    • Maximum Workers: The upper limit of workers that can be created to handle high load. The system will automatically scale between min and max workers based on demand.
    • GPUs per worker: Number of GPUs allocated to each worker
    • Specific GPU selection: Choose which GPU types to use
  • Framework Settings:

    • Model configuration (e.g., model name, enforce-eager mode)
    • Environment variables (e.g., Hugging Face token)
  • Resource Management:

    • Concurrent request limits
    • Idle timeout settings

Security

  • All endpoints require API token authentication
  • Deployments are served through cloudflare.

Making API Requests

API Key Required

Before making any API requests, you'll need an API key. You can generate one from your JarvisLabs API Settings.

The API can be accessed in multiple ways:

  1. Direct HTTP requests (async pattern with submit/fetch)
  2. OpenAI SDK (compatible with streaming)

Direct API Access

For direct API access, you can use HTTP requests:

1. Submit Request
curl -X 'POST' \
'https://serverless.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Authorization: Bearer {your_api_key}' \
-H 'Content-Type: application/json' \
-d '{
"input": {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "tell me a beautiful story of moon?"}
]
}
}'

# Response
{
"id": "{request_id}"
}
2. Fetch Response
curl -X 'GET' \
'https://serverless.jarvislabs.net/deployment/{deployment_id}/{request_id}' \
-H 'accept: application/json' \
-H 'Authorization: Bearer {your_api_key}'

# Response
{
"id": "chatcmpl-c9754530d60c44f3ad731e8287b1af7d",
"object": "chat.completion",
"created": 1745229877,
"model": "meta-llama/Llama-3.2-3B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Once upon a time, in a small village...",
"reasoning_content": null,
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 49,
"total_tokens": 619,
"completion_tokens": 570,
"prompt_tokens_details": null
}
}

Using OpenAI SDK

You can use the OpenAI SDK for a more familiar interface, supporting both streaming and non-streaming responses.

Streaming Example
from openai import OpenAI

# Initialize client with your deployment
client = OpenAI(
base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
api_key="{your_api_key}"
)

# Create streaming completion
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."}
],
stream=True,
temperature=0.7,
top_p=0.9
)

# Process streaming response
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
OpenAI SDK Features
  • Streaming Support: Get real-time responses as they're generated
  • Familiar Interface: Uses the same API as OpenAI's ChatGPT
  • Parameter Control: Adjust temperature, top_p, and other generation parameters
  • Error Handling: Built-in error handling and retry mechanisms

Understanding Request States

When you make a request to your deployment, it will go through these states:

  1. Queued: Your request is in line to be processed
  2. Fetched: Your request has been picked up and is waiting for a worker
  3. Processing: A worker is actively processing your request
  4. Completed: Your request has been successfully processed
  5. Failed: Your request encountered an error
note

Failed requests will include error details to help you understand what went wrong.