Switching to Secondary Is Faster
Source: Dev.to
Introduction
Switching to a secondary (smaller) model is often faster than using a flagship model for every step of an LLM workflow. Just as you would switch to a pistol for a quick shot rather than reloading a rifle, you can use a smaller model for boilerplate, spec drafts, and initial plans, then hand the result to a larger model for review.
Why a Smaller Model Can Be Faster
- Prefill cost: Prefill is usually a single forward pass (ignoring advanced techniques like chunking or sequence parallelism). The next token is simply
model.forward(). - Speed comparison:
- Large model generation speed: ~50 tokens / second.
- Small model generation speed: ~200 tokens / second.
- Example:
- Prompt: 16 k tokens (typical for a Claude Code session).
- Desired output: another 16 k tokens (including tool calls, reads, edits).
- Large model: 16 k / 50 ≈ 320 seconds.
- Small model: 16 k / 200 ≈ 80 seconds.
Thus, the small model can complete the same work in a quarter of the time.
Speculative Decoding Analogy
Modern decoders use a small draft model to propose multiple tokens, then a large model verifies them in parallel. Using a secondary model for the first pass is essentially speculative decoding scaled to long contexts (e.g., 16 k tokens).
Practical Workflow
-
Plan
- Use a small model for speed or a large model for precision.
- Large models are more accurate but consume more tokens during planning.
-
Review
- Pass the plan to a large model and fix any issues.
-
Generate Code
- Let the small model implement the refined specification.
-
Review Again
- Use the large model to catch mistakes the small model missed.
Model Choices
- Small model: Qwen 3.6 35B MoE – fast enough to run locally and produces reasonable boilerplate.
- Large model: Acts primarily as a reviewer rather than a first‑pass generator.
Limitations
- This approach hasn’t been extensively tested on novel codebases.
- For truly new problems, writing the initial code yourself and then using the small model for repetitive tasks (e.g., generating tests and boilerplate) works best.
Further Reading
You can find this post and more on my blog.