How We Made GPU Instance Launch 4x Faster
When we launched our new GPU region in Noida, India, spinning up a GPU instance took about 8 seconds. For a researcher clicking "Launch" and waiting for a notebook, that's annoying. But that's not the real reason we decided to fix it.
The way people use GPU instances is changing. Andrej Karpathy recently open-sourced autoresearch — a project where an AI agent autonomously runs dozens of training experiments, each one tweaking hyperparameters, architecture choices, and optimizer settings. The human writes the prompt. The agent iterates on the code. Every experiment is a separate GPU run.
This is the future we're building for. When an agent needs to spin up hundreds of experiments — launching instances, running training, tearing them down, and launching the next batch — 8 seconds per instance isn't just slow. It's a bottleneck that defeats the purpose of automation. A hundred experiments means over 13 minutes spent just waiting for instances to boot.
In about three days of focused work, we tore apart every layer of the instance creation pipeline — networking, storage, container runtime, database, even logging — and brought that number down to 1.8 seconds. A 4x improvement. At that speed, the same hundred experiments lose less than 3 minutes to startup.
This post is the story of how we did it: what we measured, what surprised us, and the specific optimizations that got us from 8 seconds to under 2.
These optimizations are live in our new Noida region today. Our older regions still run the previous architecture, but we'll be deprecating those over time. Every future region we launch will ship with these improvements from day one.
And we're not done — our goal is to push launch times under a second in the coming weeks.
What happens when you click "Launch"
Before diving into the optimizations, it helps to understand what actually happens when you launch a GPU instance on JarvisLabs. Behind that single button click, the backend runs through a pipeline:
- Allocate resources — find an available GPU server and assign an IP address
- Create storage — provision a block storage volume, format it, and mount it on the server
- Launch the container — start a container with GPU access and your environment pre-configured
- Set up networking — configure virtual networking for the container's interfaces and update routing tables
- Configure the reverse proxy — set up the proxy so you can reach your instance via HTTPS
- Start billing — record the session start and begin usage tracking
Each of these steps had its own overhead, and they were all running sequentially. Our job was to figure out which steps were actually slow — and whether they needed to be.
Step zero: making it measurable
The first thing we did wasn't an optimization at all — it was fixing our ability to see where time was being spent.
Our logs had second-level timestamps. That meant two operations that took 200ms and 1,800ms both showed up with the same timestamp. We were flying blind.
We made three changes before touching any real code:
Millisecond log timestamps. We upgraded every log formatter to include millisecond precision. This sounds trivial, but it was the single most important enabler for everything that followed. Suddenly we could see that one networking step took 3,180ms while another took 12ms.
An API response timer. We added a middleware that measures the total time for every API request and returns it in a response header. This gave us a single, reliable number to track: how long does the create-instance API call actually take, end to end?
Step-by-step instrumentation. We added timing markers throughout the instance creation pipeline — after each major step — so we could see exactly how many milliseconds each phase consumed. This turned a single "8 seconds" number into a detailed breakdown we could act on.
With these in place, the picture became clear. Here's a rough breakdown of where the time was going (these numbers are approximate, reconstructed from our commit history and PR benchmarks — we didn't save the original profiling snapshot):
| Step | Time |
|---|---|
| Networking setup (including ARP) | ~3,500ms |
| Storage volume creation | ~1,500ms |
| Container launch | ~670ms |
| SSH connection overhead | ~930ms |
| Database operations | ~500ms |
| Proxy config + other | ~400ms |
| Total | ~7,500ms |
Some of these numbers were shocking. Nearly half the total time was spent in a single networking step. That's where we started.
The biggest win: a ping that was eating half our time
This was the most humbling discovery of the entire sprint.
When a new container comes online, it gets a fresh IP address. The network gateway doesn't know about this IP yet — its ARP cache has no mapping from this IP to the container's MAC address. So we send a ping to the gateway to trigger an ARP exchange, which teaches the gateway where to route traffic for this container.
The problem? We were running ping -c 2 -W 2 — send two pings, wait up to two seconds for each reply. This single command was taking 3.18 seconds. That's 49% of the entire instance creation time, spent waiting for a ping reply we didn't even need.
Here's the thing: we only need the ARP request to be sent. The gateway learns the MAC-to-IP mapping from the request itself. We don't need to wait for the reply. We don't even need to know if the reply came back.
The fix was two changes:
- Fire and forget. We made the network announcement non-blocking — the request itself is enough to update the gateway's routing table, so we don't need to wait for confirmation before moving on.
- Use the right tool. We switched from a general-purpose ping to a targeted network announcement. No unnecessary overhead, completes in under 10 milliseconds on a local network.
The result: instance creation dropped from ~8 seconds to ~5 seconds in a single commit. A 40% improvement from fixing one line of networking code.
The lesson here stuck with us for the rest of the week: the biggest bottleneck was the dumbest one. It wasn't a complex architectural problem. It was a blocking call that didn't need to block.
Death by a thousand round-trips
With the ARP fix behind us, the profiling data pointed to a different pattern: we were making too many remote calls to the GPU server, and each one carried overhead.
Every time the backend needed to do something on a GPU server — create a storage volume, attach a network interface, start a container — it opened an SSH connection, ran a command, and waited for the result. A single instance launch involved 12 to 15 separate SSH calls, each with its own connection overhead of around 150 to 200 milliseconds.
We attacked this in two phases.
Phase one: batching
The first move was straightforward — instead of making five separate calls for networking setup and five more for storage, we combined each group into a single call. One script with all the networking commands chained together. One script for all the storage operations.
This alone cut networking setup time from around 3,500ms down to 900ms, and saved another 600ms on the storage side. Fewer round-trips, same work.
We also made a small but meaningful change to how we format new storage volumes. The formatting tool was running a cleanup operation that's useful for recycled disks but completely unnecessary for freshly created volumes — they're already zeroed out. Skipping that step shaved off more time.
Phase two: eliminating SSH entirely
Batching helped, but there was still a fundamental problem: the SSH handshake itself. Before a single command could run, the backend spent around 930 milliseconds just establishing the encrypted connection.
So we built something new. We wrote a service that runs on each GPU server, handling storage, networking, and container lifecycle operations — much faster than establishing a new SSH connection for every operation.
We migrated every server-side operation — storage provisioning, network configuration, container lifecycle — from SSH to the worker API. Then we deleted all the SSH code from the instance creation path.
The SSH handshake dropped from 930 milliseconds to zero. The per-command overhead dropped from 150 milliseconds to nearly nothing. And because the worker is a proper service rather than ad-hoc shell scripts, we got better error handling for free — it can clean up partial failures automatically, something that was fragile over SSH.
By the end of this phase, there wasn't a single SSH connection in the entire create, pause, or destroy path.
A faster container runtime
With networking and SSH out of the way, the container launch itself became the next obvious target. Starting a GPU container was taking around 670 milliseconds — not huge in absolute terms, but at this point we were chasing every hundred milliseconds.
The bottleneck was how GPU access gets wired into containers. The standard approach uses a runtime hook that probes the system's GPU devices every time a container starts. It figures out which GPUs are available, what drivers are loaded, and how to expose them — all at launch time, every single time.
But our GPU servers don't change between launches. The same GPUs are always there, with the same drivers. Probing them on every container start is redundant work.
We made two changes:
-
Switched the container runtime from a Go-based runtime to a C-based one. The C implementation has faster process initialization — it does less work at startup.
-
Switched to static device specifications. Instead of probing GPUs at launch time, we pre-generate a device spec once per server that describes exactly which GPUs are available and how to access them. The container runtime reads this spec directly — no probing, no discovery, just "here are your GPUs."
Container launch dropped from 670ms to 340ms — a 49% improvement. Not the flashiest optimization, but at this stage of the sprint, shaving 330 milliseconds felt like a big deal.
Pre-cooking storage volumes
Creating a storage volume from scratch — allocate the block device, format it with a filesystem, then mount it — was still taking 1 to 2 seconds on the critical path. We'd already optimized the commands themselves, but the work is fundamentally slow. You're creating a new disk and laying down a filesystem. There's only so fast that can go.
So we asked a different question: why do it at launch time at all?
The answer was a volume pool. A background job runs on a schedule, pre-creating and pre-formatting storage volumes in advance. It maintains a pool of ready-to-use volumes — already allocated, already formatted, just sitting there waiting.
When a user launches an instance, we grab a volume from the pool. If the pool ever runs dry, we fall back to creating one on the spot — but under normal load, there's always a pre-made volume ready.
This turned a 1 to 2 second operation into a simple mount — around 200 to 300 milliseconds. The expensive work still happens, it just happens in the background, minutes or hours before anyone clicks "Launch."
Sweating the small stuff
By this point, the big wins were behind us. But we were down to around 2.5 seconds and the goal was under 2. The remaining time was spread across a lot of small things — none of them dramatic on their own, but they added up.
Too many database commits. Every instance launch was committing to the database 4 to 5 times — after allocating the server, after creating the record, after setting up the container, after configuring the proxy. Each commit is a round-trip to the database. We restructured the code so that all the writes happen in a single transaction, committed once at the end. If anything fails, the whole thing rolls back cleanly. Fewer round-trips, and better reliability as a side effect.
Billing on the critical path. Creating invoice records, charging the wallet, and starting usage tracking were all happening before the user got a response. None of that needs to block the launch. We moved billing to a background task — the instance starts immediately, and billing catches up a moment later.
Blocking log writes. Every log line during instance creation was a synchronous disk write. Under the high log volume of a launch sequence, this added measurable latency. We switched to an in-memory queue — log records get buffered and a background thread writes them to disk. The calling code never waits for I/O.
Redundant database queries. A few utility functions were hitting the database every time they were called, even when the result hadn't changed. Simple caching — check if we already have the answer before querying — eliminated unnecessary round-trips.
A redundant validation step. We were running a redundant validation step on every reload that we were able to eliminate, saving another 170 milliseconds.
None of these changes made for a great headline. But together, they shaved off the last 500 to 700 milliseconds we needed.
The results
Here's where we ended up:
| Metric | Before | After |
|---|---|---|
| Instance launch time | ~8 seconds | ~1.8 seconds |
| Remote calls to GPU server | 12–15 SSH connections | 3–4 internal API calls |
| Database commits per launch | 4–5 | 1 |
| Concurrent launches | Serialized (blocking locks) | Parallel |
The architecture changed too. SSH is completely gone from the instance lifecycle. Storage volumes are pre-created in the background. Billing runs asynchronously. The entire launch is a single database transaction. And every future region we build will start with this foundation.
What we learned
Measure first, always. We almost started optimizing the container runtime because it "felt slow." The profiling data showed us that a simple network ping was the real problem — something we would never have guessed. Upgrading our logs to millisecond precision was a ten-minute change that shaped everything that followed.
The biggest bottlenecks are often the simplest. A blocking ping that didn't need to block. A disk formatting step that could happen ahead of time. An SSH connection to a server that already had an HTTP API. None of these required clever algorithms or complex engineering. They just required looking at the data and asking "does this actually need to happen here?"
Batch first, then eliminate. When we saw 15 SSH round-trips, the instinct was to rip out SSH immediately. Instead, we batched first — combining multiple commands into single calls. This gave us an immediate win and bought us time to design the HTTP migration properly. The batching work also helped us understand exactly what operations the HTTP API needed to support.
Move work off the critical path. Billing, storage formatting, log writes — all real work that needs to happen, but none of it needs to happen between the user clicking "Launch" and seeing their instance ready. Anything that isn't strictly required for the instance to function can happen in the background.
What's next
We're at 1.8 seconds, but we're not done. Our goal is to push launch times under a second.
Two directions we're actively exploring: in-memory queuing to reduce database round-trips on the critical path, and alternatives to containers that avoid the overhead of distributing large images across servers. Either could be the change that gets us under a second.
If you want to see the difference, sign up and try it out. You can find me on X — I'd love to hear what you think.