Building a 1,056-Test Rust CLI Without Writing Rust — Claude Code Did It
Source: Dev.to
Introduction
I don’t write Rust. I can read it well enough to catch obvious bugs, but I’ve never typed impl or fn main() from scratch. Yet I shipped a 40‑module Rust CLI with 1,056 tests in 3 weeks. Claude Code wrote every line of Rust. I wrote prompts, reviewed diffs, and made architecture decisions. The tool — ContextZip — compresses Claude Code’s own context window, so the AI built a tool to make itself work better.
Process Overview
I never gave Claude Code a vague instruction like “build a context compressor.” Every task was a sub‑agent dispatch — a scoped prompt with clear inputs, expected outputs, and test requirements.
Example Dispatch
Implement an error stacktrace filter for Node.js.
Input: raw stderr with Express middleware frames.
Output: error message + user code frames only.
Write 20+ test cases covering nested errors, empty traces, and mixed stdout/stderr.
Put the filter in src/filters/error_stacktrace.rs.
The sub‑agent implements the feature, writes tests, and runs them. Then I dispatch a second sub‑agent to review:
Review the error_stacktrace filter. Check edge cases: what happens with zero frames? Frames with no file path? Stack traces inside JSON output?
This implement‑then‑review cycle caught about 80 % of bugs before I even looked at the code.
Forking RTK
The foundation was RTK (Rust Token Killer), an open‑source CLI with 34 command modules, 60+ TOML filters, and 950 tests. I forked it and dispatched a sub‑agent to rename every reference from rtk to contextzip across 70 files:
- 1,544 insertions, 1,182 deletions
- All 950 tests still passing
Then three agents worked in parallel on:
- The install script
- GitHub Actions CI/CD for 5 platforms
- Extending the SQLite tracking system
By Friday, curl | bash installed the binary on Linux or macOS, and contextzip gain --by-feature showed per‑filter savings.
New Compression Filters
ContextZip evolved from a rename into a product with six new filters, each built via the sub‑agent cycle:
| Filter | Description |
|---|---|
| Error stacktraces | Strips framework frames from Node.js, Python, Rust, Go, Java |
| ANSI preprocessor | Removes escape codes, spinners, progress bars |
| Web page extraction | Strips navigation, footer, ads; keeps article content |
| Build error grouping | Collapses 40 identical TypeScript errors into one group |
| Package install compression | Removes deprecated warnings, keeps security alerts |
| Docker build compression | Success → 1 line; failure → full context |
Each filter received 15–20 dedicated test cases; the error‑stacktrace filter alone has 20 tests covering five languages.
Benchmark Results
I ran 102 benchmark tests with production‑scale inputs. The results varied:
| Category | Cases | Avg Savings | Best | Worst |
|---|---|---|---|---|
| Docker build | 10 | 88.2 % | 97 % | 77 % |
| ANSI/spinners | 15 | 82.5 % | 98 % | 41 % |
| Error stacktraces | 20 | 58.7 % | 97 % | 2 % |
| Build errors | 15 | 55.6 % | 90 % | -10 % |
Notable Adjustments
- Rust panic compression started at 2 % (only stripped the backtrace header). After refining the prompt with explicit Rust panic examples, it reached 80 %.
- Java stacktrace compression initially went negative (-12 %) on short traces. Adding a threshold—if compression ratio < 10 %, pass through the original output—yielded 20 % savings with no negatives.
- Build error grouping hit -10 % on single‑error inputs; the same threshold fix resolved it.
The README shows every result, including the weak spots, because “lying about benchmarks is worse than imperfect numbers.”
Roles and Split
| Role | Contributions |
|---|---|
| Me | Architecture decisions, prompt design, review, quality gates, benchmark analysis, bug triage |
| Claude Code | All Rust implementation, test writing, CI/CD configuration, README generation, install script |
The split was roughly 20 % me (thinking, reviewing, deciding) and 80 % Claude (typing, testing, building). That 20 % was the difference between shipping and not shipping; without review cycles, the Rust panic filter would still be at 2 %.
Statistics
- 1,056 tests, 0 failures
- 102 benchmark cases
- 40+ command modules (34 inherited + 6 new)
- 5‑platform CI/CD (Linux x86/musl, macOS arm64/x86, Windows)
- 3 install methods (curl, Homebrew, cargo)
- README in 4 languages
The tool works; I use it daily. My Claude Code sessions last 40–60 % longer before hitting context limits. The AI built a tool to extend its own memory, and the human review cycles are why it actually works.