Serverless
JarvisLabs Serverless is a powerful feature that allows you to deploy and scale AI models efficiently. This documentation will guide you through how to use the service effectively.
Key Features
- Currently supports VLLM framework, more frameworks coming soon.
- Concurrent request handling (up to 100 concurrent requests per deployment if the underlying framework supports it)
- Automatic scaling based on demand
- Secure API access with authentication
Creating a Deployment
You can create a serverless deployment by using the dashboard.
- Visit JarvisLabs Serverless Dashboard
- Click on
Framework
of your choice - Configure your deployment settings:
- Select your model
- Configure scaling options
- Set up environment variables
- Click "Create" to launch your deployment
Deployment Configuration
When creating a deployment, you can specify these parameters:
-
Worker Scaling:
- Minimum Workers: The number of workers that will always be running, even during periods of low activity. This ensures immediate response when requests come in.
- Maximum Workers: The upper limit of workers that can be created to handle high load. The system will automatically scale between min and max workers based on demand.
- GPUs per worker: Number of GPUs allocated to each worker
- Specific GPU selection: Choose which GPU types to use
-
Framework Settings:
- Model configuration (e.g., model name, enforce-eager mode)
- Environment variables (e.g., Hugging Face token)
-
Resource Management:
- Concurrent request limits
- Idle timeout settings
Security
- All endpoints require API token authentication
- Deployments are served through cloudflare.
Making API Requests
Before making any API requests, you'll need an API key. You can generate one from your JarvisLabs API Settings.
The API can be accessed in multiple ways:
- Direct HTTP requests (async pattern with submit/fetch)
- OpenAI SDK (compatible with streaming)
Direct API Access
For direct API access, you can use HTTP requests:
- cURL
- Python
curl -X 'POST' \
'https://serverless.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Authorization: Bearer {your_api_key}' \
-H 'Content-Type: application/json' \
-d '{
"input": {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "tell me a beautiful story of moon?"}
]
}
}'
# Response
{
"id": "{request_id}"
}
curl -X 'GET' \
'https://serverless.jarvislabs.net/deployment/{deployment_id}/{request_id}' \
-H 'accept: application/json' \
-H 'Authorization: Bearer {your_api_key}'
# Response
{
"id": "chatcmpl-c9754530d60c44f3ad731e8287b1af7d",
"object": "chat.completion",
"created": 1745229877,
"model": "meta-llama/Llama-3.2-3B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Once upon a time, in a small village...",
"reasoning_content": null,
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 49,
"total_tokens": 619,
"completion_tokens": 570,
"prompt_tokens_details": null
}
}
import aiohttp
import json
import asyncio
async def submit_request(deployment_id, api_key):
url = f'https://serverless.jarvislabs.net/deployment/{deployment_id}/v1/chat/completions'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
payload = {
"input": {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "tell me a beautiful story of moon?"}
]
}
}
async with aiohttp.ClientSession() as session:
# Submit request
async with session.post(url, headers=headers, json=payload) as response:
result = await response.json()
request_id = result['id']
# Fetch response
fetch_url = f'https://serverless.jarvislabs.net/deployment/{deployment_id}/{request_id}'
async with session.get(fetch_url, headers=headers) as response:
return await response.json()
# Usage
async def main():
deployment_id = '{deployment_id}' # Replace with your deployment ID
api_key = '{your_api_key}' # Replace with your API key
response = await submit_request(deployment_id, api_key)
print(json.dumps(response, indent=2))
asyncio.run(main())
Using OpenAI SDK
You can use the OpenAI SDK for a more familiar interface, supporting both streaming and non-streaming responses.
- Streaming
- Non-streaming
from openai import OpenAI
# Initialize client with your deployment
client = OpenAI(
base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
api_key="{your_api_key}"
)
# Create streaming completion
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."}
],
stream=True,
temperature=0.7,
top_p=0.9
)
# Process streaming response
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
from openai import OpenAI
# Initialize client with your deployment
client = OpenAI(
base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
api_key="{your_api_key}"
)
# Create completion
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the history of the moon."}
],
temperature=0.7
)
# Print response
print(response.choices[0].message.content)
- Streaming Support: Get real-time responses as they're generated
- Familiar Interface: Uses the same API as OpenAI's ChatGPT
- Parameter Control: Adjust temperature, top_p, and other generation parameters
- Error Handling: Built-in error handling and retry mechanisms
Understanding Request States
When you make a request to your deployment, it will go through these states:
- Queued: Your request is in line to be processed
- Fetched: Your request has been picked up and is waiting for a worker
- Processing: A worker is actively processing your request
- Completed: Your request has been successfully processed
- Failed: Your request encountered an error
Failed requests will include error details to help you understand what went wrong.