Claude Code Experiment: More Tokens Doesn't Mean Better Code

Published: 3 weeks ago (January 13, 2026 at 08:05 PM EST)

2 min read

Source: Dev.to

Introduction

As we kick off the new year, many companies are accelerating AI tooling for productivity, often using token usage as a metric for AI adoption. After completing Anthropic’s “Claude Code in Action” course, I set out to test a simple hypothesis:

Hypothesis: Claude Code features follow a diminishing‑returns curve—beyond a certain point, more tokens do not produce better code.

Experiment Overview

I built a CLI Tic‑Tac‑Toe game four times, each using a different Claude Code technique. For each run I logged token usage, ran a QA agent to find bugs, and prompted a senior‑engineer role to assess code quality. From these data I derived a Quality‑per‑Token (QPT) metric to compare the techniques.

Techniques Tested

Technique	Description
Zero‑shot	Raw prompt, no context
Plan Mode	Explicit planning step before execution (`--plan`)
CLAUDE.md	Project‑level context file, no planning
CLAUDE.md + Plan Mode	Combined context file and planning step

All runs used claude --verbose to capture token counts. A fresh Claude session was started for each experiment.

Experiment Details

1. Zero‑shot Prompt

Can you create a CLI tic tac toe game using vanilla javascript with minimal dependencies? I would like it to be tested using jest and I would like it to be good quality code using classes.

2. Plan Mode

--plan

3. CLAUDE.md Context File

A CLAUDE.md file was placed in the project directory (example shown below):

CLAUDE.md file in vim window

Command run:

create the game

4. CLAUDE.md + Plan Mode

--plan create the game

Assessing Quality

A role‑based prompt was used to score the generated code:

You are a senior engineer, can you assess this code for quality and give it a score 1‑5 on correctness, clarity, structure, maintainability, and extendability? Return the average of the scores.

Manual QA Agent

A test.js file (see gist) provided context for an agent to play 10 games and report bugs. All four iterations reported zero bugs.

Results

Approach	Tokens	Quality (avg.)	QPT
CLAUDE.md	25,767	4.9	0.190
CLAUDE.md + `--plan`	32,191	4.6	0.143
Zero‑shot	42,737	4.8	0.112
`--plan`	52,910	4.8	0.091

Conclusion

The hypothesis was partially confirmed: the least token‑expensive approach (CLAUDE.md only) delivered the highest quality. As token usage increased, quality either plateaued or declined. This suggests that context quality matters more than sheer token volume.

“Context is Currency… If you don’t give the model the right background and constraints, it will confidently give you the wrong answer.” — Rani Zilpelwar

The next wave after AI adoption will likely focus on optimization and return on AI investment, with a strong emphasis on context engineering. Well‑structured context appears to yield better results than simply feeding more tokens.

Shopify CEO quote

Cover Art: “Urania” depicted by Giacinto Gimignani, 1852