TOON for LLMs: A Benchmark Performance Analysis

Published: 1 month ago (December 27, 2025 at 10:36 AM EST)

8 min read

Source: Dev.to

Every API call you make with JSON is costing you more than you think

I ran real‑world extractions using Gemini 2.5 Flash, and the results were startling: JSON consistently used 30–40 % more output tokens than TOON format. In one test, JSON consumed 471 output tokens while TOON used just 227 – a 51 % reduction.

But here’s where it gets interesting: TOON initially failed 70 % of the time.

After optimization I achieved 100 % parsing success and discovered something counter‑intuitive – TOON actually saves you money overall because it uses fewer prompt tokens when you need reliable parsing. When I tested structured outputs with Pydantic models, JSON required 389 output tokens versus TOON’s simpler encoding.

The hidden goldmine? Tool/function calling

That’s where TOON’s compact format shines brightest, slashing token costs in agentic workflows where responses become the next prompt.

This isn’t theoretical. Below are the actual prompts, parsing errors, token counts, and code that took TOON from a 70 % failure rate to production‑ready. Whether TOON beats JSON depends on your use case — and I have the data to prove exactly when.

Let’s break down the numbers

Experiment #1 – The Initial TOON Failure (70 % Success Rate)

I started with a straightforward test: extracting structured job‑description data using TOON instead of JSON.

The Setup

My prompt was simple — ask Gemini 2.5 Flash to extract role, skills, experience, location, and responsibilities from a job posting. For the output format I used what seemed logical: I showed TOON’s encoded structure using the actual output format (essentially a drop‑in replacement approach).

Prompt

Extract Role, Primary Skills, Secondary Skills,
Minimum Experience, Maximum Experience,
Location, Employment Type, Summary, and Responsibilities

Job Description:

Output in TOON format:

Role: ""
"Primary Skills"[2]: Python,JavaScript
"Secondary Skills"[2]: Responsibility,Communication
"Minimum Experience": ""
"Maximum Experience": ""
Location: ""
"Employment Type": ""
Summary: ""
Responsibilities[2]: Task A,Task B

What I expected: By showing the encoded format with empty strings and generic placeholders, the model would understand the structure.

Reality check – 70 % failure rate

The errors were telling:

Error parsing TOON format for JD#2: Expected 10 values, but got 16
Error parsing TOON format for JD#5: Missing colon after key

The model was confused about arrays. Sometimes it output Skills: Python, JavaScript, React as a flat string; other times it attempted brackets but malformed the syntax.

Hypothesis: Showing only empty examples was the problem. The model needed to see real data patterns, especially for arrays.

Token Usage (Failed Attempts, 70 % Success Rate)

	Tokens
Prompt	729
Output	227
Success Rate	~30 % initially, improved to 70 % after adding two real examples with populated arrays

JSON Token Usage (same test)

	Tokens
Prompt	723
Output	471

Key Insight

TOON’s compact syntax is unforgiving. JSON’s redundancy ({"key":"value"}) helps models self‑correct. TOON’s Key: value format offers no such safety net, so the model needed concrete examples, not abstract templates.

But 70 % wasn’t good enough for production. Time to fix this properly.

Experiment #2 – Achieving 100 % Parsing Success (And the Token Trade‑off)

I needed to fix the 70 % success rate. The solution? Stop being minimalist with examples.

Revised Prompt

Extract Role, Primary Skills, Secondary Skills,
Minimum Experience, Maximum Experience,
Location, Employment Type, Summary, and Responsibilities

Job Description:

Output in TOON format. Example structure:

Role: "Senior Data Scientist"
Primary_Skills:
  [1]: "Machine Learning"
  [2]: "Statistical Analysis"
Secondary_Skills:
  [0]: "Big Data"
  [1]: "Cloud Platforms"
Minimum_Experience: "5 years"
Maximum_Experience: "10 years"
Location: "New York, NY or Remote"
Employment_Type: "Full-time"
Summary: "Lead data science initiatives"
Responsibilities:
  [0]: "Design ML models"
  [1]: "Analyze datasets"

Now provide the extraction in TOON format. Keep the format exactly as shown above.

Result: 100 % parsing. No more malformed arrays. No more missing colons.

Catch: The prompt got heavier.

Token Comparison – TOON vs JSON (same 10 job descriptions)

Approach	Prompt Tokens	Output Tokens	Total Tokens	Success Rate
JSON	723	471	1 194	100 %
TOON – Initial (70 % success)	729	227 ✅	956	70 % ❌
TOON – Optimized (100 % success)	802 ❌ (+11 % vs JSON)	455 ✅ (3.4 % reduction vs JSON)	1 257	100 % ✅

The Uncomfortable Truth

For basic extraction tasks, optimized TOON costs MORE than JSON.

Output is slightly more compact (455 vs 471 tokens).
The verbose prompting needed for 100 % reliability erases any savings.
In fact, you’re paying ~5 % more per request.

Why keep testing TOON?

Because the baseline comparison is misleading. Real‑world LLM applications don’t just extract data once — they use structured outputs for:

Pydantic model validation (native SDK support)
Tool/function calling (where output becomes input)
Multi‑turn agentic workflows

In those scenarios, the token savings from a compact output format can be significant, especially when the same structured data is passed around repeatedly.

Takeaway

JSON is forgiving and easier to get right with minimal prompting, but it burns more output tokens.
TOON can dramatically reduce output tokens, but you must invest in richer prompts (real examples, explicit array syntax) to achieve reliable parsing.
When structured data is re‑used (e.g., tool calls, agent loops), the token savings from TOON’s compact format can outweigh the extra prompt cost.

Feel free to dive into the code snippets and token‑count scripts I used; they’re included in the repository linked below. Happy token‑optimizing!

Experiment #3: Pydantic Models — Where the SDK Does the Heavy Lifting

Here’s where things get interesting. Modern LLM SDKs have first‑class support for structured outputs using Pydantic models. Instead of prompt engineering, you define a schema and let the SDK handle formatting.

Key difference: You don’t need to explain the output format in your prompt — the SDK extracts it from your Pydantic model automatically.

The Setup: Google’s GenAI SDK

I used the same job‑extraction task, but this time with a Pydantic model:

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config={
        "response_mime_type": "application/json",
        "response_schema": JobModel,
    },
)

Notice what’s missing: no output‑format instructions, no examples, no “Output as JSON with these exact keys.”

The SDK injects the schema behind the scenes.

Become a member – The SDK injects the schema behind the scenes.

Token Comparison: Pydantic JSON vs. Manual TOON

Metric	Pydantic + JSON (SDK‑Managed)	Manual TOON (Experiment #2)
Prompt tokens	647 ✅ (19.3 % less than optimized TOON)	802 ❌
Output tokens	389 ✅ (14.5 % less than optimized TOON)	455 ❌
Success rate	100 % ✅	100 % ✅
Parsing	Native (SDK returns typed Python objects)	Custom (you write the parser)

The Brutal Takeaway

For structured extraction with strong SDK support, Pydantic shines. Native Pydantic integration delivers:

✅ Cleaner prompts (~155 fewer prompt tokens)
✅ Smaller outputs (~66 fewer output tokens)
✅ No custom parsing logic
✅ Built‑in type validation
✅ Parsed objects returned directly, ready to use
✅ A much smoother developer experience

Because of this, I’ll increasingly rely on Pydantic and native parsing support for structured extraction. It’s simply more reliable and maintainable than handling parsing and validation manually.

Note: There is one scenario where JSON’s verbosity becomes a genuine liability: tool calling in agentic workflows. That’s where TOON finally proves its worth.

Experiment #4: Tool Calling — Where TOON Finally Wins

In agentic workflows, the LLM doesn’t just extract data once — it calls tools, receives results, and uses those results to reason further. The tool’s response becomes part of the next prompt. If that response is bloated with JSON syntax, you’re paying for it twice: once as output, once as input.

Insight: Tool results are pure token waste. The model doesn’t need {"key": "value"} ceremony—it needs the data, efficiently encoded.

The Setup: Weather Agent with Function Calling

I built a simple agent that calls a get_current_weather function. The user asks for weather, the model calls the tool, the function returns data, and the model synthesizes a response.

Version A: JSON Tool Response

data = {
    "location": location,
    "current": {
        "temperature": "72 F",
        "condition": "sunny",
    },
    "forecast": forecast,
}

return json.dumps(data)   # Returns JSON string

Version B: TOON Tool Response

data = {
    "location": location,
    "current": {
        "temperature": "72 F",
        "condition": "sunny",
    },
    "forecast": forecast,
}

return encode(data)        # Returns TOON‑encoded string

Main code

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What is the weather like in New York? Share next 15 days forecast as well.",
    config=types.GenerateContentConfig(
        tools=[get_current_weather],
    ),
)

Result Token Usage

Initial prompt tokens: 152 (user message + tool definition)
Tool‑response tokens (becomes input): 480 ✅ (24 % reduction)
Model’s final output: 384 (slightly longer, but reasonable)
Total tokens: 1,016 ✅ (11.5 % reduction overall)

Why TOON Wins in Agentic Workflows

Single Tool Call

Approach	Tokens for tool result
JSON	632
TOON	480
Savings	152 tokens (24 %)

Multi‑Turn Agent (5 tool calls)

JSON: 632 × 5 = 3,160 tokens
TOON: 480 × 5 = 2,400 tokens

Savings: 760 tokens (24 %)

The Compounding Effect

Tool results are pure input tokens — you pay for them every time.
Verbosity multiplies — JSON’s {} , : , , syntax adds 20‑30 % overhead for nested data.
No parsing penalty — the model consumes TOON just as easily (verified in follow‑up tests).
Scales with agent complexity — more tools = more savings.

The Bottom Line

After testing four different scenarios, the data tells us:

TOON loses at single extractions. Whether you’re doing manual prompting or using Pydantic models, JSON with SDK support is cleaner, cheaper, and more reliable. The 17.6 % token savings from native schema integration beats TOON’s manual approach every time.
TOON wins where it counts for agents: tool‑calling workflows. When an LLM’s output becomes the next prompt—when data cycles between model and functions repeatedly—TOON’s ~24 % reduction per tool call transforms from a curiosity into a tangible cost‑saving advantage.

In short: use Pydantic/JSON for straightforward structured extraction, and switch to TOON for any agentic, tool‑calling pipeline where the model repeatedly consumes its own tool outputs.

# king 20 tool calls saves 3,040 tokens per session

**The decision matrix is simple:**

- **Building a chatbot that extracts structured data?**  
  Use **JSON + Pydantic**.

- **Building an agent that calls tools 10+ times per session?**  
  Test **TOON**.

- **Building anything else?**  
  Profile first, optimize later.

Try It Yourself

I’ve open‑sourced all the experiments, prompts, and token measurements:
View complete code and results on GitHub Gist

The repository includes:

✅ All four experiment setups with actual prompts
✅ Token usage logs for every test case
✅ Side‑by‑side comparison scripts
✅ The job descriptions I used for testing

TOON isn’t magic — it’s math. The math only works when token efficiency genuinely matters. For most applications, JSON’s ecosystem advantages outweigh the savings, but for token‑heavy agentic workflows, TOON might just pay for itself.

TOON for LLMs: A Benchmark Performance Analysis

Every API call you make with JSON is costing you more than you think

The hidden goldmine? Tool/function calling

Let’s break down the numbers

Experiment #1 – The Initial TOON Failure (70 % Success Rate)

The Setup

Reality check – 70 % failure rate

Token Usage (Failed Attempts, 70 % Success Rate)

JSON Token Usage (same test)

Key Insight

Experiment #2 – Achieving 100 % Parsing Success (And the Token Trade‑off)

Revised Prompt

Token Comparison – TOON vs JSON (same 10 job descriptions)

The Uncomfortable Truth

Why keep testing TOON?

Takeaway

Experiment #3: Pydantic Models — Where the SDK Does the Heavy Lifting

The Setup: Google’s GenAI SDK

Token Comparison: Pydantic JSON vs. Manual TOON

The Brutal Takeaway

Experiment #4: Tool Calling — Where TOON Finally Wins

The Setup: Weather Agent with Function Calling

Version A: JSON Tool Response

Version B: TOON Tool Response

Main code

Result Token Usage

Why TOON Wins in Agentic Workflows

The Compounding Effect

The Bottom Line

Try It Yourself

Related posts

ReAct vs Tool Calling: Why Your LLM Should Decide — But Never Execute

Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Claude Agent Skills: Teaching Your AI Agent to Wear Multiple Hats

Can eval setup be automatically scaffolded?

Every API call you make with JSON is costing you more than you think

The hidden goldmine? Tool/function calling

Let’s break down the numbers

Experiment #1 – The Initial TOON Failure (70 % Success Rate)

The Setup

Reality check – 70 % failure rate

Token Usage (Failed Attempts, 70 % Success Rate)

JSON Token Usage (same test)

Key Insight

Experiment #2 – Achieving 100 % Parsing Success (And the Token Trade‑off)

Revised Prompt

Token Comparison – TOON vs JSON (same 10 job descriptions)

The Uncomfortable Truth

Why keep testing TOON?

Takeaway

Experiment #3: Pydantic Models — Where the SDK Does the Heavy Lifting

The Setup: Google’s GenAI SDK

Token Comparison: Pydantic JSON vs. Manual TOON

The Brutal Takeaway

Experiment #4: Tool Calling — Where TOON Finally Wins

The Setup: Weather Agent with Function Calling

Version A: JSON Tool Response

Version B: TOON Tool Response

Main code

Result Token Usage

Why TOON Wins in Agentic Workflows

The Compounding Effect

The Bottom Line

Try It Yourself

Related posts

ReAct vs Tool Calling: Why Your LLM Should Decide — But Never Execute

Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Claude Agent Skills: Teaching Your AI Agent to Wear Multiple Hats

Can eval setup be automatically scaffolded?

Experiment #1 – The Initial TOON Failure (70 % Success Rate)

Reality check – 70 % failure rate

Token Usage (Failed Attempts, 70 % Success Rate)

Experiment #2 – Achieving 100 % Parsing Success (And the Token Trade‑off)

Token Comparison: Pydantic JSON vs. Manual TOON

Version A: JSON Tool Response

Version B: TOON Tool Response