Skip to main content

How to Run PrismAudio on JarvisLabs

· 8 min read
Vishnu Subramanian
Founder @JarvisLabs.ai

PrismAudio just dropped. It's a 518M parameter Video-to-Audio model accepted at ICLR 2026 that generates synchronized audio from silent video. Give it a clip of someone drumming on water bottles, and it produces the sound of tapping and splashing. Benchmark inference is 0.63 seconds, faster than both MMAudio (1.30s) and ThinkSound (1.07s).

We ran it on a JarvisLabs A100. Here's how we got it working, the gotchas we hit along the way, and a clean recipe you can follow.

What is PrismAudio?

PrismAudio is the first framework to integrate Reinforcement Learning into Video-to-Audio generation. It decomposes reasoning into four specialized Chain-of-Thought modules:

  • Semantic - what sounds should exist
  • Temporal - timing and rhythm
  • Aesthetic - audio quality and clarity
  • Spatial - where sounds come from in the stereo field

Each module has its own reward function, trained with a technique called Fast-GRPO that uses hybrid ODE-SDE sampling to keep RL training overhead low.

It tops all baselines on VGGSound (CLAP, DeSync, PQ, and subjective MOS scores) and their new AudioCanvas benchmark. PrismAudio builds on the ThinkSound framework (NeurIPS 2025) but is smaller (518M vs 1.3B params) and faster (0.63s vs 1.07s benchmark inference).

Run your ML workloads on Jarvislabs

A100s, H100s, and H200s with per-minute billing. Pre-configured environments, 90-second startup, and no long-term commitments.

Get Started

Running PrismAudio on an A100

PrismAudio's feature extraction pipeline loads three large models simultaneously (T5-Gemma, VideoPrism, and Synchformer), so it needs a GPU with enough VRAM and system RAM. We went with an A100 in the IN2 region: 40GB VRAM, 112GB system RAM, $1.29/hr.

jl create --gpu A100 --region IN2 --name prismaudio

Instance was up in seconds. We cloned the repo, set up a clean environment with uv, and installed everything:

cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

# Create a virtual environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate

# Install VideoPrism (Google's video encoder)
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..

# Install all dependencies
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps

# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

Gotcha 1: HuggingFace Gated Model

Feature extraction died immediately:

GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/t5gemma-l-l-ul2-it
Access to model google/t5gemma-l-l-ul2-it is restricted.

PrismAudio uses Google's T5-Gemma as its text encoder for CoT descriptions. It's a gated model. You need to:

  1. Visit huggingface.co/google/t5gemma-l-l-ul2-it and accept the license
  2. Run huggingface-cli login with your token
huggingface-cli login --token <your-hf-token>

Gotcha 2: FFmpeg Library Path

After fixing auth, we hit another error:

ERROR - Error loading demo: Failed to initialize FFmpeg extension.
Tried versions: ['6', '5', '4', ''].

PrismAudio's requirements install PyAV 15, which bundles FFmpeg 7 internally. But torio (torchaudio's streaming decoder, used for video loading) resolves FFmpeg libraries separately via system paths. It needs FFmpeg 4-6 system libraries. Two different FFmpeg resolution mechanisms in the same project.

The fix is straightforward:

apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

The Result

With both gotchas resolved, everything ran clean:

  • Feature extraction: 74 seconds (loading T5-Gemma + VideoPrism + Synchformer, encoding all features)
  • Inference: 1.24 seconds wall-clock (24 diffusion steps, 518M param model)
  • Output: 1.4MB WAV file for an 8.3-second video
Predicting 1 samples with length 179 for ids: ['demo']
24it [00:01, 19.32it/s]
Execution time: 1.24 seconds

Total compute cost for the entire experiment: under $1.

What We Learned

  1. Multi-model pipelines need system RAM. PrismAudio loads T5-Gemma, VideoPrism, and Synchformer simultaneously during feature extraction. The A100's 112GB of system RAM handles this comfortably. When choosing a GPU for multi-model workloads, check both VRAM and system RAM specs.

  2. FFmpeg version compatibility is real. PyAV and torio resolve FFmpeg libraries through different paths. Install the system FFmpeg libraries via apt-get and you're good.

  3. Gated models are a silent dependency. PrismAudio's README mentions downloading its own weights, but the T5-Gemma encoder is fetched at runtime from HuggingFace. You need to accept the license AND authenticate before your first run.

  4. GPU switching should be trivial. Need to try a different GPU? On JarvisLabs, just pause your instance and resume it with a different GPU. Same data, same setup, no re-configuring. jl pause, then resume from the dashboard or CLI.


Quick Start Recipe

If you just want PrismAudio running, here's the clean path. Tested on JarvisLabs A100 (IN2 region).

Prerequisites

To install the CLI:

uv tool install jarvislabs
jl setup

Step 1: Create an A100 Instance

jl create --gpu A100 --region IN2 --name prismaudio
tip

An A100 (40GB VRAM, 112GB RAM, $1.29/hr) is the sweet spot for PrismAudio. Need faster inference? You can always pause and switch to an H100 or H200 in the same region, keeping your data intact.

Step 2: Install Everything

SSH into the instance or use jl run:

# Clone PrismAudio
cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

# Create isolated environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate

# Install VideoPrism
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..

# Install dependencies
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps

# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

# Login to HuggingFace (required for T5-Gemma)
uv pip install huggingface-hub[cli]
huggingface-cli login --token <your-hf-token>

# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

Step 3: Run Inference

source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2

# Prepare your video
mkdir -p videos cot_coarse results
cp /path/to/your/video.mp4 videos/demo.mp4

# Create CoT description
echo "id,caption_cot" > cot_coarse/cot.csv
echo 'demo,"Semantic: describe the sounds. Temporal: describe the rhythm. Aesthetic: describe the audio quality. Spatial: describe where sounds come from."' >> cot_coarse/cot.csv

# Extract features (~74 seconds)
torchrun --nproc_per_node=1 data_utils/prismaudio_data_process.py --inference_mode True

# Get video duration
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 videos/demo.mp4)

# Run inference (~1.24 seconds)
python predict.py \
--model-config "PrismAudio/configs/model_configs/prismaudio.json" \
--duration-sec "$DURATION" \
--ckpt-dir "ckpts/prismaudio.ckpt" \
--results-dir "results"

Your generated audio is at results/MMDD_batch_size1/demo.wav.

Step 4: Launch the Gradio Web UI

For interactive use with a browser-based interface:

source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2
export GRADIO_TEMP_DIR=/tmp/gradio_temp
mkdir -p /tmp/gradio_temp

python app.py --server_name 0.0.0.0 --server_port 6006

The Gradio app loads all models at startup (~2 minutes), then serves on port 6006. On JarvisLabs, port 6006 is automatically exposed as an API endpoint through Cloudflare. Click the API button on your instance in the dashboard, then click API 1 to open the Gradio interface in your browser.

Upload any video, write a CoT description covering the four dimensions (Semantic, Temporal, Aesthetic, Spatial), and hit generate. Feature extraction takes ~60-70 seconds, inference takes ~1-2 seconds.

Step 5: Clean Up

When you're done, pause the instance to stop billing:

jl pause <instance-id>

Or destroy it entirely:

jl destroy <instance-id>

Run your ML workloads on Jarvislabs

A100s, H100s, and H200s with per-minute billing. Pre-configured environments, 90-second startup, and no long-term commitments.

Get Started

Common Issues

IssueCauseFix
GatedRepoError: 401T5-Gemma is a gated modelAccept license at HuggingFace, then huggingface-cli login
Failed to initialize FFmpeg extensiontorio resolves FFmpeg separately from PyAV; needs system FFmpeg 4-6apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev
Gradio KeyError: 'GRADIO_TEMP_DIR'Missing env varexport GRADIO_TEMP_DIR=/tmp/gradio_temp && mkdir -p /tmp/gradio_temp

Cost Summary

GPUVRAMRAM$/hrNotes
A10040GB112GB$1.29Recommended. Tested, 1.24s inference.
H10080GB200GB$2.69Faster inference if needed.
H200141GB200GB$3.80Maximum headroom.

Total compute cost for this experiment on A100: under $1.


PrismAudio is one of the more impressive V2A models we've seen. 518M parameters, faster than anything else, and the four-dimensional CoT reasoning produces audio that actually matches the video. The model is open source under Apache 2.0 for research use.

Try it on JarvisLabs. Spin up an A100, follow the recipe above, and you'll have PrismAudio generating audio from your videos in under 15 minutes. Get started at jarvislabs.ai.

Run your ML workloads on Jarvislabs

A100s, H100s, and H200s with per-minute billing. Pre-configured environments, 90-second startup, and no long-term commitments.

Get Started