Two Efficient Technologies to Reduce AI Token Costs: TOON and Microsoft's LLMLingua-2

Published: (December 22, 2025 at 05:08 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

Why Token Costs Matter

Building AI applications has never been more accessible. OpenAI’s GPT‑4, Anthropic’s Claude, and Google’s Gemini have turned what felt like science‑fiction a few years ago into everyday reality.

Enterprises are now creating:

  • Intelligent agents
  • Retrieval‑augmented generation (RAG) systems
  • GenAI applications that solve complex business challenges at scale

But once you move from prototype to production, token costs hit hard:

  • Every API call to a large language model (LLM) is billed per token.
  • A token ≈ a word or part of a word.
  • You pay for both the input (data + instructions) and the output (the model’s response).

Result: Many developers discover—often too late—that their data formatting and prompt design are inflating token usage by 40‑60 %.

Two Technologies That Cut Token Waste

TechnologyWhat It DoesToken Savings
TOON (Token‑Oriented Object Notation)A data‑serialization format built for LLMs.30‑60 % fewer tokens for structured data
LLMLingua‑2 (Microsoft)A prompt‑compression engine that removes 50‑80 % of a prompt while preserving meaning.50‑80 % fewer tokens for prompts

Both solve different problems but share the same goal: dramatically lower AI costs.

TOON – Token‑Oriented Object Notation

What Is TOON?

TOON is a serialization format designed specifically for large language models. It blends:

  • YAML‑style indentation for nested objects
  • CSV‑style tabular layout for uniform arrays

Instead of repeating field names for every array element (as JSON does), TOON declares the field names once and then lists only the values.

JSON vs. TOON (Employee Example)

JSON (traditional)

{
  "team": [
    {"id": 1, "name": "Tej B", "role": "engineer"},
    {"id": 2, "name": "Praveen V", "role": "designer"},
    {"id": 3, "name": "Partha G", "role": "manager"}
  ]
}

TOON (efficient)

team[3]{id,name,role}:
1,Tej B,engineer
2,Praveen V,designer
3,Partha G,manager

Same data, far fewer tokens.

Performance Highlights

  • 73.9 % accuracy with 39.6 % fewer tokens vs. JSON’s 69.7 % accuracy.
  • LLMs actually understand TOON better than JSON.

Real‑World Cost Example

DataFormatApprox. TokensToken Reduction
100 products × 8 fieldsJSON~12,000
100 products × 8 fieldsTOON~6,000≈ 50 %

If you run thousands of such calls daily, you can save hundreds to thousands of dollars each month.

Ideal Use‑Cases

  • Uniform arrays of objects (e.g., customer records, product catalogs, transaction logs)
  • Database query results sent to AI agents
  • Analytics dashboards, sales reports, inventory data
  • Any tabular or semi‑tabular data that an LLM must process

Note: For deeply nested or highly non‑uniform structures, JSON may still be more efficient. TOON is a specialized tool, not a universal replacement.

Installation

pip install toon-py

Basic Python Usage

from toon_py import encode, decode

products = [
    {"id": 101, "name": "Laptop",   "price": 1299, "stock": 45},
    {"id": 102, "name": "Mouse",    "price":   29, "stock": 230},
    {"id": 103, "name": "Keyboard", "price":   89, "stock": 156}
]

# Encode to TOON
toon_data = encode(products)
print(toon_data)
# ──> [3]{id,name,price,stock}:
#     101,Laptop,1299,45
#     102,Mouse,29,230
#     103,Keyboard,89,156

# Use in a prompt
prompt = f"Analyze this inventory:\n{toon_data}\n\nWhich products need restocking?"
# Send `prompt` to OpenAI, Claude, Gemini, etc.
# → Save 40‑60 % on tokens!

Command‑Line Interface

# JSON → TOON
toon input.json -o output.toon

# TOON → JSON
toon data.toon -o output.json

LLMLingua‑2 – Prompt Compression

What Is LLMLingua‑2?

LLMLingua‑2 (Microsoft) tackles prompt length rather than data serialization. It treats compression as a token‑classification problem, using a Transformer encoder to decide which tokens are essential given the full bidirectional context.

  • Trained via data distillation from GPT‑4, so it knows exactly what LLMs need.
  • Acts like an expert editor that removes filler words, redundant phrases, and unnecessary context while preserving meaning.

Compression Power

  • Up to 20× compression with minimal performance loss.
  • 3‑6× faster than the original LLMLingua.
  • Improves end‑to‑end latency by ≈ 1.6×.

When to Use It

  • Long system instructions for AI agents
  • Context passages from documents (e.g., legal text, research papers)
  • Few‑shot examples or demonstrations
  • Any prompt that approaches the model’s token limit

Putting It All Together

GoalToolHow It Helps
Reduce token count for structured dataTOONCompact serialization (field names declared once)
Reduce token count for prompts / instructionsLLMLingua‑2Intelligent removal of redundant wording while preserving semantics
Overall cost reductionBoth50‑80 % fewer tokens → lower API bills, faster latency

Quick Checklist for Developers

  1. Identify data that is sent as uniform arrays → switch to TOON.
  2. Run LLMLingua‑2 on any prompt longer than ~500 tokens.
  3. Measure token usage before & after conversion/compression.
  4. Iterate: fine‑tune field ordering in TOON or adjust compression aggressiveness in LLMLingua‑2.

2.9× Compression (2×‑5× Ratio)

A 1,000‑token prompt compressed to 200 tokens isn’t just cheaper—it’s faster. Your users get responses quicker, you pay less, and everyone wins.

If you’re building Retrieval‑Augmented Generation (RAG) systems, LLMLingua‑2 is a game‑changer. RAG applications often pull 10‑20 document chunks to answer a single question, which means massive context to send to your LLM.

LLMLingua mitigates the “lost in the middle” issue in LLMs, enhancing long‑context information processing. By compressing retrieved context, you keep all the important information while dramatically reducing token count.

LLMLingua has been integrated into LangChain and LlamaIndex, two widely‑used RAG frameworks.

Installation

pip install llmlingua

Basic Compression

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

context = """
The quarterly financial report shows strong growth in Q4 2024.
Revenue increased by 28% compared to Q3, primarily driven by
enterprise sales. Operating costs decreased by 12% due to
improved efficiency measures. Customer retention improved to 96%,
while new customer acquisition grew by 34%. The product team
shipped five major features that significantly increased user
engagement metrics across all segments...
"""

question = "What were the main growth drivers in Q4?"
prompt = f"{context}\n\nQuestion: {question}"

compressed = compressor.compress_prompt(
    prompt,
    rate=0.5,                     # Target 50 % compression
    force_tokens=['\n', '?']      # Preserve important formatting
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt: {compressed['compressed_prompt']}")

With LangChain RAG

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMLinguaCompressor
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

compressor = LLMLinguaCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What are the key findings from the research?"
)

For Agentic AI

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

agent_instructions = """
You are a financial analysis agent with access to market data,
company financials, and industry reports. Your task is to identify
investment opportunities by analyzing revenue trends, profit margins,
market positioning, competitive advantages, and growth potential.
Consider both quantitative metrics and qualitative factors...
"""

compressed = compressor.compress_prompt(
    agent_instructions,
    rate=0.4  # 60 % compression
)

agent_prompt = f"{compressed['compressed_prompt']}\n\nTask: Analyze Tesla's Q4 performance"

When to Use TOON vs. LLMLingua‑2

Use CaseRecommended Tool
Structured data with repeated fields (customer lists, product catalogs, DB results)TOON
Tabular or semi‑tabular data (sales reports, analytics, inventory)TOON
AI agents processing data (arrays of objects with the same structure)TOON
API responses (JSON from backend)TOON
Long text prompts (instructions, explanations, guidelines)LLMLingua‑2
RAG systems (compress retrieved document context)LLMLingua‑2
Natural language (meeting transcripts, reports, articles)LLMLingua‑2
Multi‑step reasoning (complex chain‑of‑thought prompts)LLMLingua‑2
Sophisticated GenAI apps combining structured data & lengthy instructionsBoth
High‑volume systems (thousands of AI API calls daily)Both
Cost‑sensitive applications (token efficiency impacts profitability)Both

Combining TOON & LLMLingua‑2

from toon_py import encode
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

sales_data = [
    {"month": "Oct", "revenue": 450_000, "customers": 1_245, "churn": 23},
    {"month": "Nov", "revenue": 485_000, "customers": 1_312, "churn": 19},
    {"month": "Dec", "revenue": 520_000, "customers": 1_398, "churn": 15},
]
toon_data = encode(sales_data)

instructions = """
Analyze the quarterly sales performance considering seasonal trends,
customer acquisition costs, competitive landscape changes, and
market conditions. Compare with historical data from the past
three years. Identify key growth drivers and potential risks.
Provide actionable recommendations for the sales team based on
data‑driven insights and market analysis...
"""
compressed_instructions = compressor.compress_prompt(instructions, rate=0.5)

final_prompt = f"""
{compressed_instructions['compressed_prompt']}

Q4 Sales Data:
{toon_data}

Question: What's the trend and what should we do next quarter?
"""

# Send `final_prompt` to your LLM of choice.

Bottom Line

Building with AI isn’t just about model capabilities—it’s about sustainable economics. Leveraging LLMLingua‑2 for long‑form text and TOON for structured data gives you maximum token efficiency, lower costs, and faster responses. 🚀

AI Application Costs vs. Revenue

If token costs grow faster than revenue, your AI application will fail. TOON and LLMLingua‑2 give you breathing room. They let you:

  • Ship features faster without constantly optimizing for token costs
  • Scale sustainably as your user base grows
  • Compete effectively even against companies with bigger budgets
  • Build richer experiences because you’re not cutting features to save tokens

Both technologies are production‑ready, open‑source, and actively maintained.

TOON

  • Installation: pip install toon-py
  • Multiple language implementations available
  • Integration time: ~5 minutes to add to existing applications

LLMLingua‑2

  • Installation: pip install llmlingua
  • Integrated with LangChain and LlamaIndex
  • Backed by Microsoft Research with ongoing development

Getting Started (No Full Rewrite Needed)

  1. Identify your most expensive API calls (log tokens per endpoint).
  2. Test:
    • Use TOON on structured‑data endpoints.
    • Use LLMLingua‑2 on text‑heavy prompts.
  3. Measure actual savings (tokens before vs. after).
  4. Roll out gradually across your application.

Why It Matters

The AI revolution is expensive. Smart developers are finding ways to make it affordable. TOON and LLMLingua‑2 are two of the most effective tools available today.

Start cutting your API bills now.

Back to Blog

Related posts

Read more »