Two Efficient Technologies to Reduce AI Token Costs: TOON and Microsoft's LLMLingua-2

Published: 1 month ago (December 22, 2025 at 05:08 PM EST)

7 min read

Source: Dev.to

Why Token Costs Matter

Building AI applications has never been more accessible. OpenAI’s GPT‑4, Anthropic’s Claude, and Google’s Gemini have turned what felt like science‑fiction a few years ago into everyday reality.

Enterprises are now creating:

Intelligent agents
Retrieval‑augmented generation (RAG) systems
GenAI applications that solve complex business challenges at scale

But once you move from prototype to production, token costs hit hard:

Every API call to a large language model (LLM) is billed per token.
A token ≈ a word or part of a word.
You pay for both the input (data + instructions) and the output (the model’s response).

Result: Many developers discover—often too late—that their data formatting and prompt design are inflating token usage by 40‑60 %.

Two Technologies That Cut Token Waste

Technology	What It Does	Token Savings
TOON (Token‑Oriented Object Notation)	A data‑serialization format built for LLMs.	30‑60 % fewer tokens for structured data
LLMLingua‑2 (Microsoft)	A prompt‑compression engine that removes 50‑80 % of a prompt while preserving meaning.	50‑80 % fewer tokens for prompts

Both solve different problems but share the same goal: dramatically lower AI costs.

TOON – Token‑Oriented Object Notation

What Is TOON?

TOON is a serialization format designed specifically for large language models. It blends:

YAML‑style indentation for nested objects
CSV‑style tabular layout for uniform arrays

Instead of repeating field names for every array element (as JSON does), TOON declares the field names once and then lists only the values.

JSON vs. TOON (Employee Example)

JSON (traditional)

{
  "team": [
    {"id": 1, "name": "Tej B", "role": "engineer"},
    {"id": 2, "name": "Praveen V", "role": "designer"},
    {"id": 3, "name": "Partha G", "role": "manager"}
  ]
}

TOON (efficient)

team[3]{id,name,role}:
1,Tej B,engineer
2,Praveen V,designer
3,Partha G,manager

Same data, far fewer tokens.

Performance Highlights

73.9 % accuracy with 39.6 % fewer tokens vs. JSON’s 69.7 % accuracy.
LLMs actually understand TOON better than JSON.

Real‑World Cost Example

Data	Format	Approx. Tokens	Token Reduction
100 products × 8 fields	JSON	~12,000	—
100 products × 8 fields	TOON	~6,000	≈ 50 %

If you run thousands of such calls daily, you can save hundreds to thousands of dollars each month.

Ideal Use‑Cases

Uniform arrays of objects (e.g., customer records, product catalogs, transaction logs)
Database query results sent to AI agents
Analytics dashboards, sales reports, inventory data
Any tabular or semi‑tabular data that an LLM must process

Note: For deeply nested or highly non‑uniform structures, JSON may still be more efficient. TOON is a specialized tool, not a universal replacement.

Installation

pip install toon-py

Basic Python Usage

from toon_py import encode, decode

products = [
    {"id": 101, "name": "Laptop",   "price": 1299, "stock": 45},
    {"id": 102, "name": "Mouse",    "price":   29, "stock": 230},
    {"id": 103, "name": "Keyboard", "price":   89, "stock": 156}
]

# Encode to TOON
toon_data = encode(products)
print(toon_data)
# ──> [3]{id,name,price,stock}:
#     101,Laptop,1299,45
#     102,Mouse,29,230
#     103,Keyboard,89,156

# Use in a prompt
prompt = f"Analyze this inventory:\n{toon_data}\n\nWhich products need restocking?"
# Send `prompt` to OpenAI, Claude, Gemini, etc.
# → Save 40‑60 % on tokens!

Command‑Line Interface

# JSON → TOON
toon input.json -o output.toon

# TOON → JSON
toon data.toon -o output.json

LLMLingua‑2 – Prompt Compression

What Is LLMLingua‑2?

LLMLingua‑2 (Microsoft) tackles prompt length rather than data serialization. It treats compression as a token‑classification problem, using a Transformer encoder to decide which tokens are essential given the full bidirectional context.

Trained via data distillation from GPT‑4, so it knows exactly what LLMs need.
Acts like an expert editor that removes filler words, redundant phrases, and unnecessary context while preserving meaning.

Compression Power

Up to 20× compression with minimal performance loss.
3‑6× faster than the original LLMLingua.
Improves end‑to‑end latency by ≈ 1.6×.

When to Use It

Long system instructions for AI agents
Context passages from documents (e.g., legal text, research papers)
Few‑shot examples or demonstrations
Any prompt that approaches the model’s token limit

Putting It All Together

Goal	Tool	How It Helps
Reduce token count for structured data	TOON	Compact serialization (field names declared once)
Reduce token count for prompts / instructions	LLMLingua‑2	Intelligent removal of redundant wording while preserving semantics
Overall cost reduction	Both	50‑80 % fewer tokens → lower API bills, faster latency

Quick Checklist for Developers

Identify data that is sent as uniform arrays → switch to TOON.
Run LLMLingua‑2 on any prompt longer than ~500 tokens.
Measure token usage before & after conversion/compression.
Iterate: fine‑tune field ordering in TOON or adjust compression aggressiveness in LLMLingua‑2.

2.9× Compression (2×‑5× Ratio)

A 1,000‑token prompt compressed to 200 tokens isn’t just cheaper—it’s faster. Your users get responses quicker, you pay less, and everyone wins.

If you’re building Retrieval‑Augmented Generation (RAG) systems, LLMLingua‑2 is a game‑changer. RAG applications often pull 10‑20 document chunks to answer a single question, which means massive context to send to your LLM.

LLMLingua mitigates the “lost in the middle” issue in LLMs, enhancing long‑context information processing. By compressing retrieved context, you keep all the important information while dramatically reducing token count.

LLMLingua has been integrated into LangChain and LlamaIndex, two widely‑used RAG frameworks.

Installation

pip install llmlingua

Basic Compression

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

context = """
The quarterly financial report shows strong growth in Q4 2024.
Revenue increased by 28% compared to Q3, primarily driven by
enterprise sales. Operating costs decreased by 12% due to
improved efficiency measures. Customer retention improved to 96%,
while new customer acquisition grew by 34%. The product team
shipped five major features that significantly increased user
engagement metrics across all segments...
"""

question = "What were the main growth drivers in Q4?"
prompt = f"{context}\n\nQuestion: {question}"

compressed = compressor.compress_prompt(
    prompt,
    rate=0.5,                     # Target 50 % compression
    force_tokens=['\n', '?']      # Preserve important formatting
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt: {compressed['compressed_prompt']}")

With LangChain RAG

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMLinguaCompressor
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

compressor = LLMLinguaCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What are the key findings from the research?"
)

For Agentic AI

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

agent_instructions = """
You are a financial analysis agent with access to market data,
company financials, and industry reports. Your task is to identify
investment opportunities by analyzing revenue trends, profit margins,
market positioning, competitive advantages, and growth potential.
Consider both quantitative metrics and qualitative factors...
"""

compressed = compressor.compress_prompt(
    agent_instructions,
    rate=0.4  # 60 % compression
)

agent_prompt = f"{compressed['compressed_prompt']}\n\nTask: Analyze Tesla's Q4 performance"

When to Use TOON vs. LLMLingua‑2

Use Case	Recommended Tool
Structured data with repeated fields (customer lists, product catalogs, DB results)	TOON
Tabular or semi‑tabular data (sales reports, analytics, inventory)	TOON
AI agents processing data (arrays of objects with the same structure)	TOON
API responses (JSON from backend)	TOON
Long text prompts (instructions, explanations, guidelines)	LLMLingua‑2
RAG systems (compress retrieved document context)	LLMLingua‑2
Natural language (meeting transcripts, reports, articles)	LLMLingua‑2
Multi‑step reasoning (complex chain‑of‑thought prompts)	LLMLingua‑2
Sophisticated GenAI apps combining structured data & lengthy instructions	Both
High‑volume systems (thousands of AI API calls daily)	Both
Cost‑sensitive applications (token efficiency impacts profitability)	Both

Combining TOON & LLMLingua‑2

from toon_py import encode
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

sales_data = [
    {"month": "Oct", "revenue": 450_000, "customers": 1_245, "churn": 23},
    {"month": "Nov", "revenue": 485_000, "customers": 1_312, "churn": 19},
    {"month": "Dec", "revenue": 520_000, "customers": 1_398, "churn": 15},
]
toon_data = encode(sales_data)

instructions = """
Analyze the quarterly sales performance considering seasonal trends,
customer acquisition costs, competitive landscape changes, and
market conditions. Compare with historical data from the past
three years. Identify key growth drivers and potential risks.
Provide actionable recommendations for the sales team based on
data‑driven insights and market analysis...
"""
compressed_instructions = compressor.compress_prompt(instructions, rate=0.5)

final_prompt = f"""
{compressed_instructions['compressed_prompt']}

Q4 Sales Data:
{toon_data}

Question: What's the trend and what should we do next quarter?
"""

# Send `final_prompt` to your LLM of choice.

Bottom Line

Building with AI isn’t just about model capabilities—it’s about sustainable economics. Leveraging LLMLingua‑2 for long‑form text and TOON for structured data gives you maximum token efficiency, lower costs, and faster responses. 🚀

AI Application Costs vs. Revenue

If token costs grow faster than revenue, your AI application will fail. TOON and LLMLingua‑2 give you breathing room. They let you:

Ship features faster without constantly optimizing for token costs
Scale sustainably as your user base grows
Compete effectively even against companies with bigger budgets
Build richer experiences because you’re not cutting features to save tokens

Both technologies are production‑ready, open‑source, and actively maintained.

TOON

Installation: pip install toon-py
Multiple language implementations available
Integration time: ~5 minutes to add to existing applications

LLMLingua‑2

Installation: pip install llmlingua
Integrated with LangChain and LlamaIndex
Backed by Microsoft Research with ongoing development

Getting Started (No Full Rewrite Needed)

Identify your most expensive API calls (log tokens per endpoint).
Test:
- Use TOON on structured‑data endpoints.
- Use LLMLingua‑2 on text‑heavy prompts.
Measure actual savings (tokens before vs. after).
Roll out gradually across your application.

Why It Matters

The AI revolution is expensive. Smart developers are finding ways to make it affordable. TOON and LLMLingua‑2 are two of the most effective tools available today.

Start cutting your API bills now.

Two Efficient Technologies to Reduce AI Token Costs: TOON and Microsoft's LLMLingua-2

Why Token Costs Matter

Two Technologies That Cut Token Waste

TOON – Token‑Oriented Object Notation

What Is TOON?

JSON vs. TOON (Employee Example)

Performance Highlights

Real‑World Cost Example

Ideal Use‑Cases

Installation

Basic Python Usage

Command‑Line Interface

LLMLingua‑2 – Prompt Compression

What Is LLMLingua‑2?

Compression Power

When to Use It

Putting It All Together

Quick Checklist for Developers

2.9× Compression (2×‑5× Ratio)

Installation

Basic Compression

With LangChain RAG

For Agentic AI

When to Use TOON vs. LLMLingua‑2

Combining TOON & LLMLingua‑2

Bottom Line

AI Application Costs vs. Revenue

TOON

LLMLingua‑2

Getting Started (No Full Rewrite Needed)

Why It Matters

Related posts

I Built a TOON Playground: Save 40% on LLM Tokens

Gemini 3 Flash arrives with reduced costs and latency — a powerful combo for enterprises

Best AI Video Generators: Sora, Kling AI, and Google Veo

AI Search Optimization Needs a Knowledge Layer — Not Just Answer Monitoring