Serverless

JarvisLabs Serverless is a powerful feature that allows you to deploy and scale AI models efficiently. This documentation will guide you through how to use the service effectively.

Key Features

Currently supports VLLM framework, more frameworks coming soon.
Concurrent request handling (up to 100 concurrent requests per deployment if the underlying framework supports it)
Automatic scaling based on demand
Secure API access with authentication

Creating a Deployment

You can create a serverless deployment by using the dashboard.

Visit JarvisLabs Serverless Dashboard
Click on Framework of your choice
Configure your deployment settings:
- Select your model
- Configure scaling options
- Set up environment variables
Click "Create" to launch your deployment

Deployment Configuration

When creating a deployment, you can specify these parameters:

Worker Scaling:
- Minimum Workers: The number of workers that will always be running, even during periods of low activity. This ensures immediate response when requests come in.
- Maximum Workers: The upper limit of workers that can be created to handle high load. The system will automatically scale between min and max workers based on demand.
- GPUs per worker: Number of GPUs allocated to each worker
- Specific GPU selection: Choose which GPU types to use
Framework Settings:
- Model configuration (e.g., model name, enforce-eager mode)
- Environment variables (e.g., Hugging Face token)
Resource Management:
- Concurrent request limits
- Idle timeout settings

Security

All endpoints require API token authentication
Deployments are served through cloudflare.

Making API Requests

API Key Required

Before making any API requests, you'll need an API key. You can generate one from your JarvisLabs API Settings.

The API can be accessed in multiple ways:

Direct HTTP requests (async pattern with submit/fetch)
OpenAI SDK (compatible with streaming)

Direct API Access

For direct API access, you can use HTTP requests:

cURL
Python

1. Submit Request
curl -X 'POST' \
  'https://serverless.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer {your_api_key}' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": {
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "tell me a beautiful story of moon?"}
    ]
  }
}'

# Response
{
  "id": "{request_id}"
}

2. Fetch Response
curl -X 'GET' \
  'https://serverless.jarvislabs.net/deployment/{deployment_id}/{request_id}' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer {your_api_key}'

# Response
{
  "id": "chatcmpl-c9754530d60c44f3ad731e8287b1af7d",
  "object": "chat.completion",
  "created": 1745229877,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Once upon a time, in a small village...",
        "reasoning_content": null,
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 49,
    "total_tokens": 619,
    "completion_tokens": 570,
    "prompt_tokens_details": null
  }
}

Complete Example
import aiohttp
import json
import asyncio

async def submit_request(deployment_id, api_key):
    url = f'https://serverless.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    payload = {
        "input": {
            "model": "meta-llama/Llama-3.2-3B-Instruct",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "tell me a beautiful story of moon?"}
            ]
        }
    }
    
    async with aiohttp.ClientSession() as session:
        # Submit request
        async with session.post(url, headers=headers, json=payload) as response:
            result = await response.json()
            request_id = result['id']
            
            # Fetch response
            fetch_url = f'https://serverless.jarvislabs.net/deployment/{deployment_id}/{request_id}'
            async with session.get(fetch_url, headers=headers) as response:
                return await response.json()

# Usage
async def main():
    deployment_id = '{deployment_id}'  # Replace with your deployment ID
    api_key = '{your_api_key}'        # Replace with your API key
    response = await submit_request(deployment_id, api_key)
    print(json.dumps(response, indent=2))

asyncio.run(main())

Using OpenAI SDK

You can use the OpenAI SDK for a more familiar interface, supporting both streaming and non-streaming responses.

Streaming
Non-streaming

Streaming Example
from openai import OpenAI

# Initialize client with your deployment
client = OpenAI(
    base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
    api_key="{your_api_key}"
)

# Create streaming completion
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about the history of the moon."}
    ],
    stream=True,
    temperature=0.7,
    top_p=0.9
)

# Process streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Non-streaming Example
from openai import OpenAI

# Initialize client with your deployment
client = OpenAI(
    base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
    api_key="{your_api_key}"
)

# Create completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about the history of the moon."}
    ],
    temperature=0.7
)

# Print response
print(response.choices[0].message.content)

OpenAI SDK Features

Streaming Support: Get real-time responses as they're generated
Familiar Interface: Uses the same API as OpenAI's ChatGPT
Parameter Control: Adjust temperature, top_p, and other generation parameters
Error Handling: Built-in error handling and retry mechanisms

Understanding Request States

When you make a request to your deployment, it will go through these states:

Queued: Your request is in line to be processed
Fetched: Your request has been picked up and is waiting for a worker
Processing: A worker is actively processing your request
Completed: Your request has been successfully processed
Failed: Your request encountered an error

note

Failed requests will include error details to help you understand what went wrong.

Key Features​

Creating a Deployment​

Deployment Configuration​

Security​

Making API Requests​

Direct API Access​

Using OpenAI SDK​

Understanding Request States​