7 Magic Words That Make Your LLM 10 Smarter at Math

Published: 3 days ago (June 7, 2026 at 05:56 PM EDT)

5 min read

Source: Dev.to

🌐 Live demo (LOOK · UNDERSTAND · BUILD): https://dev48v.infy.uk/prompt/day2-chain-of-thought.html

Day 2 of my PromptFromZero series — 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.

Today: Chain of Thought (CoT). The single highest-impact prompt change you can make. Costs nothing. Adds 7 words. Often turns wrong answers into right ones.

The setup

Same problem. Same model. Two prompts.

Roger has 5 tennis balls. He buys 2 cans of 3 balls each.
How many balls does he have now?

Enter fullscreen mode


Exit fullscreen mode

Prompt A — “just answer”

…question… Just answer with the number, nothing else.

Enter fullscreen mode


Exit fullscreen mode

Small / older models often answer: 8. Wrong.

Prompt B — Chain of Thought

…same question… Let's think step by step.

Enter fullscreen mode


Exit fullscreen mode

Model writes:

Roger starts with 5 balls.
He buys 2 cans, each holding 3 balls.
2 × 3 = 6 new balls.
5 + 6 = 11.

Final answer: 11.

Enter fullscreen mode


Exit fullscreen mode

Right.

Same model. Same problem. Seven extra words on the prompt. The accuracy boost on multi-step math problems is consistently massive.

Why it works

LLMs generate one token at a time, each token conditioned on every token that came before. If you ask for the answer with no working, the model has to compress the whole computation into a single number prediction. There’s nowhere to “scratch paper”.

Chain of Thought forces the model to write the scratch paper out. Each step becomes additional context for the next step. By the time it gets to “Final answer:”, the arithmetic is already on the page — anchored to real numbers, not vibes.

More tokens spent = more compute per problem = more reasoning capacity. CoT is literally trading latency for accuracy.

When to use it

Use CoT Skip CoT

Math word problems Factual lookups (“What’s the capital of France?”)

Multi-step logical reasoning Creative writing

Cause-and-effect chains Short summaries

Subtle classifications Code completion

Heuristic: if you would write scratch-paper math yourself, the model will benefit from CoT.

Build it in 10 minutes

mkdir cot-from-zero && cd cot-from-zero
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key_here" > .env

Enter fullscreen mode


Exit fullscreen mode

Get a free Gemini key at https://aistudio.google.com/apikey (no credit card).

// cot.mjs
import { generateText } from "ai";
import { google } from "@ai-sdk/google";

const model = google("gemini-2.5-flash");
const problem = "Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have now?";

const bad = await generateText({
  model,
  prompt: problem + "\n\nJust answer with the number, nothing else."
});

const good = await generateText({
  model,
  prompt: problem + "\n\nLet's think step by step."
});

console.log("=== Without CoT ===\n" + bad.text);
console.log("\n=== With CoT ===\n" + good.text);

Enter fullscreen mode


Exit fullscreen mode

node --env-file=.env cot.mjs

Enter fullscreen mode


Exit fullscreen mode

Two runs of the same model on the same problem, side by side. The difference is visible immediately.

Levels of CoT

Zero-shot CoT (above)

Just add “Let’s think step by step.” Works on most modern models.

Few-shot CoT

Prepend 2-3 worked examples before the question:

Q: Sara had 4 apples and got 2 more. How many?
A: Sara had 4. She got 2 more. 4 + 2 = 6. Answer: 6.

Q: Roger has 5 tennis balls. He buys 2 cans of 3 each. How many balls?
A: [model continues in same format]

Enter fullscreen mode


Exit fullscreen mode

Better on harder problems — the model has explicit examples of the reasoning depth you want.

Structured CoT

Force a format:

"Solve this. Number your steps 1, 2, 3. Final answer on a new line starting 'Answer:'."

Enter fullscreen mode


Exit fullscreen mode

Easier to parse programmatically.

Hidden CoT

Generate the chain, then strip it before showing the user:

const reply = result.text;
const clean = reply.replace(/[\s\S]*?/g, '').trim();

Enter fullscreen mode


Exit fullscreen mode

User sees just the answer; the model gets the accuracy benefit.

What about reasoning models?

GPT-5, Claude 4 Sonnet, o1, o3, Gemini 2.5 — modern flagship models train with reasoning baked in. They don’t need “let’s think step by step.” They do it automatically.

But:

They cost 10× more per token
They’re slower (visible “thinking…” UI)
They’re overkill for simple tasks

Cheap model + CoT prompt ≈ reasoning model output, at ~10% of the cost. CoT is still the highest-leverage technique you can use on small models.

What this unlocks

CoT is the foundation. Every fancier reasoning technique builds on top:

Self-consistency — sample N CoT runs, take majority vote

ReAct — CoT + tool calls interleaved (Day 1)

Tree of Thoughts — branch CoT into multiple paths, evaluate

Reflection — generate, criticize own output, regenerate

Master CoT first. Everything else is variations.

Try it now

Three tabs on one page:

https://dev48v.infy.uk/prompt/day2-chain-of-thought.html

LOOK — animated side-by-side trace of both prompts

UNDERSTAND — 8 click-through steps on why CoT works

BUILD — copy the code, run it on your machine

What’s next in PromptFromZero

Day 3: Self-consistency. Sample 5 CoT runs, take majority vote. Same model, even higher accuracy.

Series: 50 LLM techniques · 50 days · Vercel AI SDK throughout.

🌐 All techniques: https://dev48v.infy.uk/promptfromzero.php

7 Magic Words That Make Your LLM 10 Smarter at Math

Related posts

Cats Disappear at Sunset. Can You Find Them in Time?

I'm so tired to code. Not even Vibe Coding... D:

I Built a Multi-Platform Publishing CLI for AI Agents

I built an embedded scheduler in Rust because I was tired of adding Redis just to run a background job