Skip to main content

Deploying Whisper Large V3 on JarvisLabs Serverless

Want to transcribe audio files at scale without managing infrastructure? Let's deploy OpenAI's Whisper Large V3 model on JarvisLabs Serverless. This tutorial will show you how to set up a production-ready audio transcription service in minutes.

Why Whisper Large V3?

Whisper Large V3 is OpenAI's latest speech recognition model, offering:

  • Support for 99 languages
  • Improved accuracy over previous versions
  • Better handling of accents and background noise
  • Zero-shot translation capabilities

Setting Up Your Deployment

  1. Head to the JarvisLabs Serverless Dashboard
  2. Click on "VLLM Framework"
  3. Configure your deployment with these settings:
    • Model: openai/whisper-large-v3
    • Minimum Workers: 1
    • Maximum Workers: 3 (adjust based on your expected load)
    • GPUs per worker: 1
    • GPU Type: RTX 6000 ada, A6000
    • Arguments: {"enforce-eager": true}

Install the Required Python Packages

Before you run the examples below, make sure the following Python packages are installed locally. Using uv is recommended for faster installs, but regular pip works just as well.


apt install ffmpeg

uv venv myenv --python=python3.12 --seed
source myenv/bin/activate

# Fast installer (recommended)
uv pip install openai librosa pydub

# If you already have the OpenAI package
uv pip install librosa pydub

# Classic pip alternative
pip install openai librosa pydub

If you want to test inside a jupyter lab, then do these additional steps.

uv pip install ipykernel
python -m ipykernel install --user --name=my-env --display-name "Python3.12"

Testing Your Deployment

Before testing your deployment, you'll need to obtain an API key from JarvisLabs. This is different from an OpenAI API key. You can create one by:

  1. Visit JarvisLabs API Keys
  2. Click "Create API Key"
  3. Copy the generated key and keep it secure

Once you have your API key, you can test your deployment using the OpenAI SDK. Here's a complete example:

from openai import OpenAI
import json

# Initialize the client
deployment_id = "YOUR_DEPLOYMENT_ID" # Replace with your deployment ID
client = OpenAI(
base_url=f"https://serverless.jarvislabs.net/openai/{deployment_id}/v1/",
api_key="YOUR_API_KEY" # Replace with your API key
)

# Transcribe an audio file
def transcribe_audio(audio_file_path):
with open(audio_file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=audio_file,
response_format="text"
)
return response

# Example usage
audio_path = "path/to/your/audio.wav"
transcription = transcribe_audio(audio_path)
print(f"Transcription: {transcription}")

Important Note: Audio Length Limitation

Currently, VLLM has a 30-second limitation for audio transcription. This is a known limitation in the VLLM framework, as discussed in this GitHub issue. The VLLM team is working on implementing native chunking support, but for now, you'll need to implement a chunking strategy in your code. Here's a simple example of how to handle longer audio files:

import librosa
import numpy as np
from pydub import AudioSegment

def chunk_audio(audio_path, chunk_length_ms=30000, overlap_ms=1000):
"""
Split audio into overlapping chunks of 30 seconds
"""
audio = AudioSegment.from_file(audio_path)
chunks = []

for i in range(0, len(audio), chunk_length_ms - overlap_ms):
chunk = audio[i:i + chunk_length_ms]
chunks.append(chunk)

return chunks

def transcribe_long_audio(audio_path):
"""
Transcribe audio longer than 30 seconds
"""
chunks = chunk_audio(audio_path)
transcriptions = []

for i, chunk in enumerate(chunks):
# Save chunk to temporary file
chunk_path = f"temp_chunk_{i}.wav"
chunk.export(chunk_path, format="wav")

# Transcribe chunk
with open(chunk_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=audio_file,
response_format="text"
)
transcriptions.append(response)

# Clean up temporary file
import os
os.remove(chunk_path)

# Combine transcriptions (you might want to implement smarter merging logic)
return " ".join(transcriptions)

# Example usage
transcriptions = transcribe_long_audio("path/to/your/audio.mp3")
note

The chunking strategy above is a basic implementation. For production use, you might want to:

  • Implement silence detection for more natural chunk boundaries
  • Add overlap handling to avoid cutting words
  • Implement smarter merging logic for the final transcription
  • Add error handling and retry logic

Note: The VLLM team is actively working on implementing native chunking support. Once available, this will provide a more robust solution for handling longer audio files. You can track the progress in the GitHub issue.

Best Practices

  1. File Format: While Whisper supports various formats, WAV files typically give the best results
  2. Audio Quality: Ensure your audio is clear and has minimal background noise