Claude Code Experiment: More Tokens Doesn't Mean Better Code
Source: Dev.to
Introduction
As we kick off the new year, many companies are accelerating AI tooling for productivity, often using token usage as a metric for AI adoption. After completing Anthropic’s “Claude Code in Action” course, I set out to test a simple hypothesis:
Hypothesis: Claude Code features follow a diminishing‑returns curve—beyond a certain point, more tokens do not produce better code.
Experiment Overview
I built a CLI Tic‑Tac‑Toe game four times, each using a different Claude Code technique. For each run I logged token usage, ran a QA agent to find bugs, and prompted a senior‑engineer role to assess code quality. From these data I derived a Quality‑per‑Token (QPT) metric to compare the techniques.
Techniques Tested
| Technique | Description |
|---|---|
| Zero‑shot | Raw prompt, no context |
| Plan Mode | Explicit planning step before execution (--plan) |
| CLAUDE.md | Project‑level context file, no planning |
| CLAUDE.md + Plan Mode | Combined context file and planning step |
All runs used claude --verbose to capture token counts. A fresh Claude session was started for each experiment.
Experiment Details
1. Zero‑shot Prompt
Can you create a CLI tic tac toe game using vanilla javascript with minimal dependencies? I would like it to be tested using jest and I would like it to be good quality code using classes.
2. Plan Mode
--plan
3. CLAUDE.md Context File
A CLAUDE.md file was placed in the project directory (example shown below):

Command run:
create the game
4. CLAUDE.md + Plan Mode
--plan create the game
Assessing Quality
A role‑based prompt was used to score the generated code:
You are a senior engineer, can you assess this code for quality and give it a score 1‑5 on correctness, clarity, structure, maintainability, and extendability? Return the average of the scores.
Manual QA Agent
A test.js file (see gist) provided context for an agent to play 10 games and report bugs. All four iterations reported zero bugs.
Results
| Approach | Tokens | Quality (avg.) | QPT |
|---|---|---|---|
| CLAUDE.md | 25,767 | 4.9 | 0.190 |
CLAUDE.md + --plan | 32,191 | 4.6 | 0.143 |
| Zero‑shot | 42,737 | 4.8 | 0.112 |
--plan | 52,910 | 4.8 | 0.091 |
Conclusion
The hypothesis was partially confirmed: the least token‑expensive approach (CLAUDE.md only) delivered the highest quality. As token usage increased, quality either plateaued or declined. This suggests that context quality matters more than sheer token volume.
“Context is Currency… If you don’t give the model the right background and constraints, it will confidently give you the wrong answer.” — Rani Zilpelwar
The next wave after AI adoption will likely focus on optimization and return on AI investment, with a strong emphasis on context engineering. Well‑structured context appears to yield better results than simply feeding more tokens.

Cover Art: “Urania” depicted by Giacinto Gimignani, 1852