Skip to main content

AI Videos, Music, and 3D Models from Your Terminal — ComfyUI on Cloud GPUs with Claude Code

· 9 min read
Vishnu Subramanian
Founder @JarvisLabs.ai
A step-by-step guide to running ComfyUI workflows on cloud GPUs using the JarvisLabs CLI and Claude Code. Generate videos from photos, music from text, and 3D models from images — across multiple GPUs in parallel — all without leaving your terminal.

ComfyUI on JarvisLabs with Claude Code

ComfyUI is one of the most powerful tools for AI content generation. It supports hundreds of workflows — image-to-video, text-to-music, photo-to-3D, image editing, and more — through a visual node-based interface.

But what if you could skip the UI entirely? What if you could tell an AI agent "generate a video from this photo on a cloud GPU" and have it handle the infrastructure, model downloads, workflow execution, and result delivery — all from your terminal?

That's exactly what we built. Using the JarvisLabs CLI, Claude Code, and a custom ComfyUI skill, you can:

  • Generate videos from photographs using Wan 2.2 (14B parameter model)
  • Create music with vocals from a text description using ACE-Step 1.5
  • Turn a single photo into a 3D model using Hunyuan3D 2.1
  • Run workflows across multiple GPUs in parallel with shared model storage
  • Download results to your laptop — videos, audio files, 3D models

All without opening a browser.

This guide walks you through the full setup and shows you how to do it yourself.

What you need

  • A JarvisLabs accountsign up here and add funds to your wallet
  • The JarvisLabs CLIpip install jarvislabs
  • Claude Codeinstall from Anthropic
  • The ComfyUI skill — a single file that teaches Claude how to run ComfyUI on JarvisLabs

Step 1: Install the JarvisLabs CLI

pip install jarvislabs
jl setup

During setup, authenticate with your API token and install the agent skill files. The base JarvisLabs skill teaches Claude Code how to create instances, run jobs, and manage GPUs. We'll add the ComfyUI skill on top of that.

Check that everything works:

jl status    # Shows your balance and running instances
jl gpus # Shows available GPUs and pricing

Step 2: Install the ComfyUI skill

The ComfyUI skill is a single markdown file that teaches Claude Code how to set up and run any ComfyUI workflow on JarvisLabs. Download it into your Claude Code skills directory:

mkdir -p ~/.claude/skills/comfyui
curl -o ~/.claude/skills/comfyui/SKILL.md \
https://raw.githubusercontent.com/jarvislabs-ai/jl-comfyui-skill/main/SKILL.md

That's it. Claude Code will automatically pick up the skill. You can now ask it to run ComfyUI workflows and it knows exactly what to do — which GPU to pick, how to install ComfyUI, where to download models, and how to submit workflows via the API.

Step 3: Generate your first video

Open Claude Code and ask:

Set up ComfyUI with Image to Video (Wan 2.2) on JarvisLabs

Claude will:

  1. Ask you a few questions — single instance or multiple? Which GPU?
  2. Create a GPU instance — picks the right GPU and storage based on the workflow
  3. Install ComfyUI in a clean virtual environment under /home/ComfyUI
  4. Read the workflow blueprint to find exactly which models are needed
  5. Download all models in parallel — for Wan 2.2 I2V, that's 6 models totaling ~37GB
  6. Start the ComfyUI server and give you the URL
Clean environment setup

The skill sets up ComfyUI in its own virtual environment, keeping it isolated from the system Python. This avoids version conflicts and makes the setup reproducible across different templates and GPU types.

Once ComfyUI is running, you can either use the web UI at the provided URL, or let Claude submit workflows via the API. To generate a video from a photo:

Generate a video from this cat photo — the cat should blink and turn its head

Claude will upload your image, build the Wan 2.2 I2V workflow, submit it to the ComfyUI API, wait for it to finish, and download the video to your laptop. On an A100-80GB, each 5-second video takes about 45 seconds.

How it works under the hood

The skill reads ComfyUI's built-in blueprint files to discover workflows. Each blueprint contains the full node graph and the model download URLs embedded in the node metadata. Claude parses this, downloads what's needed, and converts the blueprint into an API-compatible workflow:

# Claude reads the blueprint
jl exec <id> -- sh -lc 'cat "/home/ComfyUI/blueprints/Image to Video (Wan 2.2).json"'

# Extracts models and downloads them in parallel
jl exec <id> -- sh -lc 'cd /home/ComfyUI/models && \
wget -O diffusion_models/wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors "https://huggingface.co/..." & \
wget -O diffusion_models/wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors "https://huggingface.co/..." & \
wget -O text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors "https://huggingface.co/..." & \
wait'

# Submits the workflow via REST API
# POST http://localhost:6006/prompt with the node graph

This pattern — read blueprint, download models, submit via API — works for any ComfyUI workflow. The skill doesn't hardcode workflows; it discovers them.

Step 4: Try different workflows

The same approach works for any ComfyUI workflow. Here are some we tested:

Text to Music (ACE-Step 1.5)

Generate a 30-second lo-fi hip hop track with jazzy piano and chill beats

ACE-Step generates music with optional vocals. You can specify genre, key, BPM, and even provide lyrics with [verse] and [chorus] structure tags. Five 30-second tracks generate in under 60 seconds on an H100. The results download as FLAC files.

Photo to 3D Model (Hunyuan3D 2.1)

Turn this photo into a 3D model

Hunyuan3D takes a single photograph and generates a full 3D .glb model. Claude had never seen this workflow before — it read the blueprint, found the model (a single 6.9GB checkpoint), downloaded it, and generated a 3D model. First try, no errors, 60 seconds.

You can view .glb files with Quick Look on Mac (select the file, press Space) or drop them into any online 3D viewer.

The pattern generalizes

Every ComfyUI workflow follows the same pattern:

  1. Read the blueprint — extracts model URLs and node graph
  2. Download models — in parallel, verified by file size
  3. Build the API workflow — convert blueprint nodes to API format
  4. Submit and monitor — POST to /prompt, poll /queue until done
  5. Download results — videos, audio, images, 3D models

If ComfyUI has a blueprint for it, Claude can run it.

Step 5: Scale across multiple GPUs

This is where it gets powerful. Say you want to generate a batch of videos, or test a workflow on different GPUs. Instead of downloading 37GB of models to each instance, you use a JarvisLabs Filesystem — shared storage that any instance in the same region can access.

Create a filesystem and download models once

jl filesystem create --name comfyui-models --storage 100

Ask Claude to download the models to the filesystem instead of instance storage:

Download the Wan 2.2 models to the filesystem at /home/jl_fs/comfyui-models/

Spin up multiple instances with shared models

# Each instance gets the filesystem attached — models are already there
jl create --gpu H100 --storage 50 --template pytorch --fs-id <fs_id> --name "comfyui-1"
jl create --gpu H100 --storage 50 --template pytorch --fs-id <fs_id> --name "comfyui-2"
jl create --gpu H100 --storage 50 --template pytorch --fs-id <fs_id> --name "comfyui-3"

On each instance, ComfyUI symlinks to the models on the shared filesystem. No re-downloading:

# Instant — just symlinks, no data transfer
ln -sf /home/jl_fs/comfyui-models/diffusion_models/*.safetensors /home/ComfyUI/models/diffusion_models/

Each new instance goes from zero to running ComfyUI in about 2 minutes. Without the filesystem, model downloads alone would take 10-15 minutes per instance.

Performance

We benchmarked the filesystem against instance storage with a 4.5GB model on an H100:

Instance StorageShared Filesystem
First model load1.71s1.73s
Shared across instancesNoYes
Survives instance destructionNoYes

First-load performance is identical. The filesystem has no speed penalty — and the time you save by not re-downloading models across instances is significant.

Distribute work across instances

With five instances running, Claude distributes workflows across all of them:

# 20 prompts per instance, running in parallel
for instance_id in instances:
submit_batch(instance_id, prompts[batch_start:batch_start+20])

We generated 100 images across five H100s in about 2 minutes. The same approach works for batching video generation, trying different prompts at scale, or testing workflows on different GPU types.

When you're done:

jl pause <id>    # Stop billing, keep data
jl destroy <id> # Remove everything (filesystem persists)

How the skill works

The ComfyUI skill is a single file at ~/.claude/skills/comfyui/SKILL.md. It teaches Claude Code:

  • What to ask before setup — single instance vs. multi-instance, which workflow, GPU preferences
  • How to install ComfyUI — in a clean virtual environment, with the right dependencies
  • How to discover workflows — reading blueprint files for model URLs and node graphs
  • How to use the filesystem — when to recommend it, how to set up symlinks
  • Known workflows — pre-mapped model URLs and sizes for popular workflows like Wan 2.2, ACE-Step, and more

The skill is open source. You can extend it with new workflows, add your own model presets, or customize the setup for your team's needs.

Claude Code skills are just markdown files — they're instructions that Claude reads and follows. There's no code to install or dependencies to manage. Drop the file in the right directory and Claude gains the capability.

Getting started

# 1. Install the CLI
pip install jarvislabs
jl setup

# 2. Install the ComfyUI skill
mkdir -p ~/.claude/skills/comfyui
curl -o ~/.claude/skills/comfyui/SKILL.md \
https://raw.githubusercontent.com/jarvislabs-ai/jl-comfyui-skill/main/SKILL.md

# 3. Open Claude Code and go
# "Set up ComfyUI with Image to Video on JarvisLabs"
# "Generate a video from this photo of a sunset"
# "Create a 30-second jazz track"
# "Turn this product photo into a 3D model"

From your terminal to cloud GPUs to generated videos, music, and 3D models — without switching context.


Workflows we tested:

WorkflowModelGPUTime per output
Image to VideoWan 2.2 14B (fp8)A100-80GB~45s per 5s video
Text to MusicACE-Step 1.5 TurboH100~10s per 30s track
Photo to 3D ModelHunyuan3D 2.1H100~60s per .glb model
Text to ImageZ-Image-TurboH100~1.2s per 1024x1024

The ComfyUI skill is available at github.com/jarvislabs-ai/jl-comfyui-skill.