Building a 1,056-Test Rust CLI Without Writing Rust — Claude Code Did It

Published: 1 month ago (March 19, 2026 at 08:07 AM EDT)

4 min read

Source: Dev.to

Source: Dev.to

Introduction

I don’t write Rust. I can read it well enough to catch obvious bugs, but I’ve never typed impl or fn main() from scratch. Yet I shipped a 40‑module Rust CLI with 1,056 tests in 3 weeks. Claude Code wrote every line of Rust. I wrote prompts, reviewed diffs, and made architecture decisions. The tool — ContextZip — compresses Claude Code’s own context window, so the AI built a tool to make itself work better.

Process Overview

I never gave Claude Code a vague instruction like “build a context compressor.” Every task was a sub‑agent dispatch — a scoped prompt with clear inputs, expected outputs, and test requirements.

Example Dispatch

Implement an error stacktrace filter for Node.js. 
Input: raw stderr with Express middleware frames. 
Output: error message + user code frames only. 
Write 20+ test cases covering nested errors, empty traces, and mixed stdout/stderr. 
Put the filter in src/filters/error_stacktrace.rs.

The sub‑agent implements the feature, writes tests, and runs them. Then I dispatch a second sub‑agent to review:

Review the error_stacktrace filter. Check edge cases: what happens with zero frames? Frames with no file path? Stack traces inside JSON output?

This implement‑then‑review cycle caught about 80 % of bugs before I even looked at the code.

Forking RTK

The foundation was RTK (Rust Token Killer), an open‑source CLI with 34 command modules, 60+ TOML filters, and 950 tests. I forked it and dispatched a sub‑agent to rename every reference from rtk to contextzip across 70 files:

1,544 insertions, 1,182 deletions
All 950 tests still passing

Then three agents worked in parallel on:

The install script
GitHub Actions CI/CD for 5 platforms
Extending the SQLite tracking system

By Friday, curl | bash installed the binary on Linux or macOS, and contextzip gain --by-feature showed per‑filter savings.

New Compression Filters

ContextZip evolved from a rename into a product with six new filters, each built via the sub‑agent cycle:

Filter	Description
Error stacktraces	Strips framework frames from Node.js, Python, Rust, Go, Java
ANSI preprocessor	Removes escape codes, spinners, progress bars
Web page extraction	Strips navigation, footer, ads; keeps article content
Build error grouping	Collapses 40 identical TypeScript errors into one group
Package install compression	Removes deprecated warnings, keeps security alerts
Docker build compression	Success → 1 line; failure → full context

Each filter received 15–20 dedicated test cases; the error‑stacktrace filter alone has 20 tests covering five languages.

Benchmark Results

I ran 102 benchmark tests with production‑scale inputs. The results varied:

Category	Cases	Avg Savings	Best	Worst
Docker build	10	88.2 %	97 %	77 %
ANSI/spinners	15	82.5 %	98 %	41 %
Error stacktraces	20	58.7 %	97 %	2 %
Build errors	15	55.6 %	90 %	-10 %

Notable Adjustments

Rust panic compression started at 2 % (only stripped the backtrace header). After refining the prompt with explicit Rust panic examples, it reached 80 %.
Java stacktrace compression initially went negative (-12 %) on short traces. Adding a threshold—if compression ratio < 10 %, pass through the original output—yielded 20 % savings with no negatives.
Build error grouping hit -10 % on single‑error inputs; the same threshold fix resolved it.

The README shows every result, including the weak spots, because “lying about benchmarks is worse than imperfect numbers.”

Roles and Split

Role	Contributions
Me	Architecture decisions, prompt design, review, quality gates, benchmark analysis, bug triage
Claude Code	All Rust implementation, test writing, CI/CD configuration, README generation, install script

The split was roughly 20 % me (thinking, reviewing, deciding) and 80 % Claude (typing, testing, building). That 20 % was the difference between shipping and not shipping; without review cycles, the Rust panic filter would still be at 2 %.

Statistics

1,056 tests, 0 failures
102 benchmark cases
40+ command modules (34 inherited + 6 new)
5‑platform CI/CD (Linux x86/musl, macOS arm64/x86, Windows)
3 install methods (curl, Homebrew, cargo)
README in 4 languages

The tool works; I use it daily. My Claude Code sessions last 40–60 % longer before hitting context limits. The AI built a tool to extend its own memory, and the human review cycles are why it actually works.

Repository

GitHub: jee599/contextzip

Building a 1,056-Test Rust CLI Without Writing Rust — Claude Code Did It

Introduction

Process Overview

Example Dispatch

Forking RTK

New Compression Filters

Benchmark Results

Notable Adjustments

Roles and Split

Statistics

Repository

Related posts

Your Pipeline Is 21.5h Behind: Catching Startups Sentiment Leads with Pulsebit

The Claude Code CVE That Should Change How You Review AI-Generated Code

Are Banking Apps Safe? Why Yes, But Your Habits Matter More

45,000 Layoffs in March. Companies Blamed AI. The Numbers Say Otherwise.