Building a 1,056-Test Rust CLI Without Writing Rust — Claude Code Did It

Published: (March 19, 2026 at 08:07 AM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

I don’t write Rust. I can read it well enough to catch obvious bugs, but I’ve never typed impl or fn main() from scratch. Yet I shipped a 40‑module Rust CLI with 1,056 tests in 3 weeks. Claude Code wrote every line of Rust. I wrote prompts, reviewed diffs, and made architecture decisions. The tool — ContextZip — compresses Claude Code’s own context window, so the AI built a tool to make itself work better.

Process Overview

I never gave Claude Code a vague instruction like “build a context compressor.” Every task was a sub‑agent dispatch — a scoped prompt with clear inputs, expected outputs, and test requirements.

Example Dispatch

Implement an error stacktrace filter for Node.js. 
Input: raw stderr with Express middleware frames. 
Output: error message + user code frames only. 
Write 20+ test cases covering nested errors, empty traces, and mixed stdout/stderr. 
Put the filter in src/filters/error_stacktrace.rs.

The sub‑agent implements the feature, writes tests, and runs them. Then I dispatch a second sub‑agent to review:

Review the error_stacktrace filter. Check edge cases: what happens with zero frames? Frames with no file path? Stack traces inside JSON output?

This implement‑then‑review cycle caught about 80 % of bugs before I even looked at the code.

Forking RTK

The foundation was RTK (Rust Token Killer), an open‑source CLI with 34 command modules, 60+ TOML filters, and 950 tests. I forked it and dispatched a sub‑agent to rename every reference from rtk to contextzip across 70 files:

  • 1,544 insertions, 1,182 deletions
  • All 950 tests still passing

Then three agents worked in parallel on:

  1. The install script
  2. GitHub Actions CI/CD for 5 platforms
  3. Extending the SQLite tracking system

By Friday, curl | bash installed the binary on Linux or macOS, and contextzip gain --by-feature showed per‑filter savings.

New Compression Filters

ContextZip evolved from a rename into a product with six new filters, each built via the sub‑agent cycle:

FilterDescription
Error stacktracesStrips framework frames from Node.js, Python, Rust, Go, Java
ANSI preprocessorRemoves escape codes, spinners, progress bars
Web page extractionStrips navigation, footer, ads; keeps article content
Build error groupingCollapses 40 identical TypeScript errors into one group
Package install compressionRemoves deprecated warnings, keeps security alerts
Docker build compressionSuccess → 1 line; failure → full context

Each filter received 15–20 dedicated test cases; the error‑stacktrace filter alone has 20 tests covering five languages.

Benchmark Results

I ran 102 benchmark tests with production‑scale inputs. The results varied:

CategoryCasesAvg SavingsBestWorst
Docker build1088.2 %97 %77 %
ANSI/spinners1582.5 %98 %41 %
Error stacktraces2058.7 %97 %2 %
Build errors1555.6 %90 %-10 %

Notable Adjustments

  • Rust panic compression started at 2 % (only stripped the backtrace header). After refining the prompt with explicit Rust panic examples, it reached 80 %.
  • Java stacktrace compression initially went negative (-12 %) on short traces. Adding a threshold—if compression ratio < 10 %, pass through the original output—yielded 20 % savings with no negatives.
  • Build error grouping hit -10 % on single‑error inputs; the same threshold fix resolved it.

The README shows every result, including the weak spots, because “lying about benchmarks is worse than imperfect numbers.”

Roles and Split

RoleContributions
MeArchitecture decisions, prompt design, review, quality gates, benchmark analysis, bug triage
Claude CodeAll Rust implementation, test writing, CI/CD configuration, README generation, install script

The split was roughly 20 % me (thinking, reviewing, deciding) and 80 % Claude (typing, testing, building). That 20 % was the difference between shipping and not shipping; without review cycles, the Rust panic filter would still be at 2 %.

Statistics

  • 1,056 tests, 0 failures
  • 102 benchmark cases
  • 40+ command modules (34 inherited + 6 new)
  • 5‑platform CI/CD (Linux x86/musl, macOS arm64/x86, Windows)
  • 3 install methods (curl, Homebrew, cargo)
  • README in 4 languages

The tool works; I use it daily. My Claude Code sessions last 40–60 % longer before hitting context limits. The AI built a tool to extend its own memory, and the human review cycles are why it actually works.

Repository

GitHub: jee599/contextzip

0 views
Back to Blog

Related posts

Read more »