Switching to Secondary Is Faster

Published: 2 days ago (May 2, 2026 at 02:57 AM EDT)

2 min read

Source: Dev.to

Introduction

Switching to a secondary (smaller) model is often faster than using a flagship model for every step of an LLM workflow. Just as you would switch to a pistol for a quick shot rather than reloading a rifle, you can use a smaller model for boilerplate, spec drafts, and initial plans, then hand the result to a larger model for review.

Why a Smaller Model Can Be Faster

Prefill cost: Prefill is usually a single forward pass (ignoring advanced techniques like chunking or sequence parallelism). The next token is simply model.forward().
Speed comparison:
- Large model generation speed: ~50 tokens / second.
- Small model generation speed: ~200 tokens / second.
Example:
- Prompt: 16 k tokens (typical for a Claude Code session).
- Desired output: another 16 k tokens (including tool calls, reads, edits).
- Large model: 16 k / 50 ≈ 320 seconds.
- Small model: 16 k / 200 ≈ 80 seconds.

Thus, the small model can complete the same work in a quarter of the time.

Speculative Decoding Analogy

Modern decoders use a small draft model to propose multiple tokens, then a large model verifies them in parallel. Using a secondary model for the first pass is essentially speculative decoding scaled to long contexts (e.g., 16 k tokens).

Practical Workflow

Plan
- Use a small model for speed or a large model for precision.
- Large models are more accurate but consume more tokens during planning.
Review
- Pass the plan to a large model and fix any issues.
Generate Code
- Let the small model implement the refined specification.
Review Again
- Use the large model to catch mistakes the small model missed.

Model Choices

Small model: Qwen 3.6 35B MoE – fast enough to run locally and produces reasonable boilerplate.
Large model: Acts primarily as a reviewer rather than a first‑pass generator.

Limitations

This approach hasn’t been extensively tested on novel codebases.
For truly new problems, writing the initial code yourself and then using the small model for repetitive tasks (e.g., generating tests and boilerplate) works best.

Switching to Secondary Is Faster

Introduction

Why a Smaller Model Can Be Faster

Speculative Decoding Analogy

Practical Workflow

Model Choices

Limitations

Further Reading

Related posts

How to build an LLM wiki with How to build an LLM wiki with Claude and MCP

How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))

How to Use the Claude API with Python

Claude, Microsoft Copilot Fail Again to Predict the Winners of the Kentucky Derby