We replaced H.264 streaming with JPEG screenshots (and it worked better)
Source: Hacker News
Part 2 of our video streaming saga.
Read Part 1: How we replaced WebRTC with WebSockets →
Let me tell you about the time we spent three months building a gorgeous, hardware‑accelerated, WebCodecs‑powered, 60 fps H.264 streaming pipeline over WebSockets…
…and then replaced it with grim | curl when the Wi‑Fi got a bit sketchy.
I wish I was joking.
We’re building Helix, an AI platform where autonomous coding agents work in cloud sandboxes. Users need to watch their AI assistants work. Think “screen share, but the thing being shared is a robot writing code.”
Last week we explained how we replaced WebRTC with a custom WebSocket streaming pipeline. This week: why that wasn’t enough.
The constraint that ruined everything
It has to work on enterprise networks.
You know what enterprise networks love?
- HTTP / HTTPS – Port 443. That’s it.
You know what enterprise networks hate?
- UDP – Blocked, deprioritized, dropped. “Security risk.”
- WebRTC – Requires TURN servers, which require UDP, which is blocked.
- Custom ports – Firewall says no.
- STUN/ICE – NAT traversal? In my corporate network? Absolutely not.
- Literally anything fun – Denied by policy.
We tried WebRTC first. It worked great in dev, great in our cloud, and even in an enterprise customer… until:
“The video doesn’t connect.”
checks network — Outbound UDP blocked. TURN server unreachable. ICE negotiation failing.
We could have fought this (set up TURN servers, configure proxies, work with IT), or we could accept reality: Everything must go through HTTPS on port 443.
Our pure‑WebSocket video pipeline
- H.264 encoding via GStreamer + VA‑API (hardware acceleration)
- Binary frames over WebSocket (L7 only, works through any proxy)
- WebCodecs API for hardware decoding in the browser
- 60 fps at 40 Mbps with sub‑100 ms latency
We were proud. We wrote Rust, TypeScript, our own binary protocol, and measured everything in microseconds.
The coffee‑shop nightmare
“The video is frozen.”
“Your Wi‑Fi is bad.”
“No, the video is definitely frozen. And now my keyboard isn’t working.”
checks the video – it shows what the AI was doing 30 seconds ago, and the delay keeps growing.
Turns out, a 40 Mbps stream doesn’t tolerate 200 ms+ latency. Who knew.
When the network gets congested:
- Frames buffer up in the TCP/WebSocket layer.
- They arrive in order (thanks TCP!) but increasingly delayed.
- Video falls further behind real‑time.
- You’re watching the AI type code from 45 seconds ago.
- By the time you see a bug, the AI has already committed it to main.
- Everything is terrible forever.
“Just lower the bitrate,” you say.
Great idea. Now it’s 10 Mbps of blocky garbage that’s still 30 seconds behind.
“What if we only send keyframes?”
Our big brain moment: H.264 keyframes (IDR frames) are self‑contained. Drop all P‑frames, send only keyframes → ~1 fps of clean video, perfect for low‑bandwidth fallback.
We added a keyframes_only flag, modified the decoder to check FrameType::Idr, set GOP = 60 (one keyframe per second at 60 fps), and tested.
Result: exactly ONE frame.
[WebSocket] Keyframe received (frame 121), sending
[WebSocket] ...
[WebSocket] ...
[WebSocket] It's been 14 seconds why is nothing else coming
[WebSocket] Failed to send audio frame: Closed
checks Wolf logs — encoder still running
checks GStreamer pipeline — frames being produced
checks Moonlight protocol layer — nothing coming through
We’re using Wolf, an excellent open‑source game‑streaming server. Our WebSocket layer sits on top of the Moonlight protocol (reverse‑engineered from NVIDIA GameStream). Somewhere in that stack, something decides that if you’re not consuming P‑frames, you’re not ready for more frames. Period.
We poked around for a couple of hours, but without diving deep into Moonlight internals we couldn’t fix it. The protocol wanted all its frames, or none.
“What if we implement proper congestion control?”
looks at TCP congestion‑control literature
closes tab
“What if we just… don’t have bad Wi‑Fi?”
stares at enterprise firewall that’s throttling everything
The screenshot epiphany
One late night, while debugging a frozen stream, I opened our screenshot endpoint:
GET /api/v1/external-agents/abc123/screenshot?format=jpeg&quality=70
The image loaded instantly – a pristine 150 KB JPEG of the remote desktop, crystal clear, no artifacts, no waiting for keyframes, no decoder state. Just pixels.
I refreshed. Another instant image. I mashed F5 like a degenerate. 5 fps of perfect screenshots.
I looked at my beautiful WebCodecs pipeline. I looked at the JPEGs. I looked at the pipeline again.
No. No, we are not doing this.
We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.
// Poll screenshots as fast as possible (capped at 10 FPS max)
const fetchScreenshot = async () => {
const response = await fetch(
`/api/v1/external-agents/${sessionId}/screenshot`
)
const blob = await response.blob()
screenshotImg.src = URL.createObjectURL(blob)
setTimeout(fetchScreenshot, 100) // yolo
}
We did it. We’re sending JPEGs.
And you know what? It works perfectly.
Quick comparison
| Property | H.264 Stream | JPEG Spam |
|---|---|---|
| Bandwidth (constant) | ~40 Mbps | 100‑500 Kbps |
So, while our fancy H.264 pipeline demanded a massive, latency‑sensitive pipe, the humble “just send screenshots” approach gave us a low‑bandwidth, low‑latency, enterprise‑friendly solution. 🎉
Video Streaming Trade‑offs
| Aspect | Stateful (corrupt = dead) | Stateless (each frame independent) |
|---|---|---|
| Latency sensitivity | Very high | Doesn’t care |
| Recovery from packet loss | Wait for keyframe (seconds) | Next frame (≈ 100 ms) |
| Implementation complexity | 3 months of Rust (fetch() in a loop) | – |
JPEG screenshots are self‑contained
- A JPEG either arrives complete or it doesn’t.
- No “partial decode”, no “waiting for the next keyframe”, no “decoder state corruption”.
When the network is bad you simply get fewer JPEGs – the ones that do arrive are perfect.
Size comparison
- 70 % quality JPEG of a 1080p desktop: 100‑150 KB
- Single H.264 keyframe: 200‑500 KB
Thus we send less data per frame and obtain better reliability.
Adaptive switching strategy
We didn’t discard the H.264 pipeline; we just added a fallback.
- Good connection (
RTT < 150 ms) → use H.264 video. - Bad connection (
RTT ≥ 150 ms) → switch to JPEG screenshots.
Key insight: we still need the WebSocket for input.
Keyboard and mouse events are tiny (≈ 10 bytes each) and travel flawlessly even on a poor connection. We only needed to stop sending massive video frames.
Control message
{"set_video_enabled": false}
The server receives this, stops sending video frames, and the client begins polling screenshots while input continues to flow.
Rust snippet
if !video_enabled.load(Ordering::Relaxed) {
continue; // skip frame, it's screenshot time
}
The oscillation bug
When video frames stop, the WebSocket becomes almost empty (only tiny input events and occasional pings).
Latency drops dramatically, so the adaptive logic thinks the connection has recovered and switches back to video.
Result:
- Video resumes → 40 Mbps flood → latency spikes → switch to screenshots.
- Screenshots → latency drops → switch back to video.
This loop repeats every ~2 seconds.
Fix
Lock the mode to screenshots until the user explicitly clicks Retry.
setAdaptiveLockedToScreenshots(true); // no more oscillation
We display an amber icon with the message:
“Video paused to save bandwidth. Click to retry.”
Now the user is in control and the infinite loop is gone.
Ubuntu Doesn’t Ship JPEG Support in grim (Because Of Course It Doesn’t)
Oh, you thought we were done? Cute.
grim is a Wayland screenshot tool—perfect for our needs. It supports JPEG output for smaller files.
The Problem
Ubuntu compiles grim without libjpeg support:
$ grim -t jpeg screenshot.jpg
error: jpeg support disabled
Incredible.
The Solution
Add a build stage to the Dockerfile that compiles grim from source with JPEG support enabled.
# Dockerfile
FROM ubuntu:25.04 AS grim-build
# Install build dependencies
RUN apt-get update && \
apt-get install -y \
meson \
ninja-build \
libjpeg-turbo8-dev \
git \
build-essential \
pkg-config
# Clone and build grim with JPEG support
RUN git clone https://git.sr.ht/~emersion/grim /opt/grim && \
cd /opt/grim && \
meson setup build -Djpeg=enabled && \
ninja -C build
Now we can build a screenshot tool from source and send JPEGs in 2025. This approach works perfectly.
The Final Architecture
┌─────────────────────────────────────────────────────────────┐
│ User's Browser │
├─────────────────────────────────────────────────────────────┤
│ WebSocket (always connected) │
│ ├─ Video frames (H.264) ─────── when RTT < 150 ms │
│ └─ GET /screenshot?quality=70 │
└─────────────────────────────────────────────────────────────┘
Connection quality
| Condition | Video mode | Frame rate | Remarks |
|---|---|---|---|
| Good connection (RTT < 150 ms) | H.264 | 60 fps | Low latency |
| Bad connection (RTT ≥ 150 ms) | JPEG screenshots | 5‑10 fps | Bandwidth‑friendly |
- When switching to screenshots, we cap the fetch rate at 10 FPS to avoid overload.
- If a frame takes longer than 100 ms to fetch, we skip it.
Star us if you find it useful!