Claude Code Experiment: More Tokens Doesn't Mean Better Code

Published: (January 13, 2026 at 08:05 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Introduction

As we kick off the new year, many companies are accelerating AI tooling for productivity, often using token usage as a metric for AI adoption. After completing Anthropic’s “Claude Code in Action” course, I set out to test a simple hypothesis:

Hypothesis: Claude Code features follow a diminishing‑returns curve—beyond a certain point, more tokens do not produce better code.

Experiment Overview

I built a CLI Tic‑Tac‑Toe game four times, each using a different Claude Code technique. For each run I logged token usage, ran a QA agent to find bugs, and prompted a senior‑engineer role to assess code quality. From these data I derived a Quality‑per‑Token (QPT) metric to compare the techniques.

Techniques Tested

TechniqueDescription
Zero‑shotRaw prompt, no context
Plan ModeExplicit planning step before execution (--plan)
CLAUDE.mdProject‑level context file, no planning
CLAUDE.md + Plan ModeCombined context file and planning step

All runs used claude --verbose to capture token counts. A fresh Claude session was started for each experiment.

Experiment Details

1. Zero‑shot Prompt

Can you create a CLI tic tac toe game using vanilla javascript with minimal dependencies? I would like it to be tested using jest and I would like it to be good quality code using classes.

2. Plan Mode

--plan

3. CLAUDE.md Context File

A CLAUDE.md file was placed in the project directory (example shown below):

CLAUDE.md file in vim window

Command run:

create the game

4. CLAUDE.md + Plan Mode

--plan create the game

Assessing Quality

A role‑based prompt was used to score the generated code:

You are a senior engineer, can you assess this code for quality and give it a score 1‑5 on correctness, clarity, structure, maintainability, and extendability? Return the average of the scores.

Manual QA Agent

A test.js file (see gist) provided context for an agent to play 10 games and report bugs. All four iterations reported zero bugs.

Results

ApproachTokensQuality (avg.)QPT
CLAUDE.md25,7674.90.190
CLAUDE.md + --plan32,1914.60.143
Zero‑shot42,7374.80.112
--plan52,9104.80.091

Conclusion

The hypothesis was partially confirmed: the least token‑expensive approach (CLAUDE.md only) delivered the highest quality. As token usage increased, quality either plateaued or declined. This suggests that context quality matters more than sheer token volume.

“Context is Currency… If you don’t give the model the right background and constraints, it will confidently give you the wrong answer.” — Rani Zilpelwar

The next wave after AI adoption will likely focus on optimization and return on AI investment, with a strong emphasis on context engineering. Well‑structured context appears to yield better results than simply feeding more tokens.

Shopify CEO quote

Cover Art: “Urania” depicted by Giacinto Gimignani, 1852

Back to Blog

Related posts

Read more »

𝗗𝗲𝘀𝗶𝗴𝗻𝗲𝗱 𝗮 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻‑𝗥𝗲𝗮𝗱𝘆 𝗠𝘂𝗹𝘁𝗶‑𝗥𝗲𝗴𝗶𝗼𝗻 𝗔𝗪𝗦 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗘𝗞𝗦 | 𝗖𝗜/𝗖𝗗 | 𝗖𝗮𝗻𝗮𝗿𝘆 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 | 𝗗𝗥 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿

!Architecture Diagramhttps://dev-to-uploads.s3.amazonaws.com/uploads/articles/p20jqk5gukphtqbsnftb.gif I designed a production‑grade multi‑region AWS architectu...