Two Efficient Technologies to Reduce AI Token Costs: TOON and Microsoft's LLMLingua-2
Source: Dev.to
Why Token Costs Matter
Building AI applications has never been more accessible. OpenAI’s GPT‑4, Anthropic’s Claude, and Google’s Gemini have turned what felt like science‑fiction a few years ago into everyday reality.
Enterprises are now creating:
- Intelligent agents
- Retrieval‑augmented generation (RAG) systems
- GenAI applications that solve complex business challenges at scale
But once you move from prototype to production, token costs hit hard:
- Every API call to a large language model (LLM) is billed per token.
- A token ≈ a word or part of a word.
- You pay for both the input (data + instructions) and the output (the model’s response).
Result: Many developers discover—often too late—that their data formatting and prompt design are inflating token usage by 40‑60 %.
Two Technologies That Cut Token Waste
| Technology | What It Does | Token Savings |
|---|---|---|
| TOON (Token‑Oriented Object Notation) | A data‑serialization format built for LLMs. | 30‑60 % fewer tokens for structured data |
| LLMLingua‑2 (Microsoft) | A prompt‑compression engine that removes 50‑80 % of a prompt while preserving meaning. | 50‑80 % fewer tokens for prompts |
Both solve different problems but share the same goal: dramatically lower AI costs.
TOON – Token‑Oriented Object Notation
What Is TOON?
TOON is a serialization format designed specifically for large language models. It blends:
- YAML‑style indentation for nested objects
- CSV‑style tabular layout for uniform arrays
Instead of repeating field names for every array element (as JSON does), TOON declares the field names once and then lists only the values.
JSON vs. TOON (Employee Example)
JSON (traditional)
{
"team": [
{"id": 1, "name": "Tej B", "role": "engineer"},
{"id": 2, "name": "Praveen V", "role": "designer"},
{"id": 3, "name": "Partha G", "role": "manager"}
]
}
TOON (efficient)
team[3]{id,name,role}:
1,Tej B,engineer
2,Praveen V,designer
3,Partha G,manager
Same data, far fewer tokens.
Performance Highlights
- 73.9 % accuracy with 39.6 % fewer tokens vs. JSON’s 69.7 % accuracy.
- LLMs actually understand TOON better than JSON.
Real‑World Cost Example
| Data | Format | Approx. Tokens | Token Reduction |
|---|---|---|---|
| 100 products × 8 fields | JSON | ~12,000 | — |
| 100 products × 8 fields | TOON | ~6,000 | ≈ 50 % |
If you run thousands of such calls daily, you can save hundreds to thousands of dollars each month.
Ideal Use‑Cases
- Uniform arrays of objects (e.g., customer records, product catalogs, transaction logs)
- Database query results sent to AI agents
- Analytics dashboards, sales reports, inventory data
- Any tabular or semi‑tabular data that an LLM must process
Note: For deeply nested or highly non‑uniform structures, JSON may still be more efficient. TOON is a specialized tool, not a universal replacement.
Installation
pip install toon-py
Basic Python Usage
from toon_py import encode, decode
products = [
{"id": 101, "name": "Laptop", "price": 1299, "stock": 45},
{"id": 102, "name": "Mouse", "price": 29, "stock": 230},
{"id": 103, "name": "Keyboard", "price": 89, "stock": 156}
]
# Encode to TOON
toon_data = encode(products)
print(toon_data)
# ──> [3]{id,name,price,stock}:
# 101,Laptop,1299,45
# 102,Mouse,29,230
# 103,Keyboard,89,156
# Use in a prompt
prompt = f"Analyze this inventory:\n{toon_data}\n\nWhich products need restocking?"
# Send `prompt` to OpenAI, Claude, Gemini, etc.
# → Save 40‑60 % on tokens!
Command‑Line Interface
# JSON → TOON
toon input.json -o output.toon
# TOON → JSON
toon data.toon -o output.json
LLMLingua‑2 – Prompt Compression
What Is LLMLingua‑2?
LLMLingua‑2 (Microsoft) tackles prompt length rather than data serialization. It treats compression as a token‑classification problem, using a Transformer encoder to decide which tokens are essential given the full bidirectional context.
- Trained via data distillation from GPT‑4, so it knows exactly what LLMs need.
- Acts like an expert editor that removes filler words, redundant phrases, and unnecessary context while preserving meaning.
Compression Power
- Up to 20× compression with minimal performance loss.
- 3‑6× faster than the original LLMLingua.
- Improves end‑to‑end latency by ≈ 1.6×.
When to Use It
- Long system instructions for AI agents
- Context passages from documents (e.g., legal text, research papers)
- Few‑shot examples or demonstrations
- Any prompt that approaches the model’s token limit
Putting It All Together
| Goal | Tool | How It Helps |
|---|---|---|
| Reduce token count for structured data | TOON | Compact serialization (field names declared once) |
| Reduce token count for prompts / instructions | LLMLingua‑2 | Intelligent removal of redundant wording while preserving semantics |
| Overall cost reduction | Both | 50‑80 % fewer tokens → lower API bills, faster latency |
Quick Checklist for Developers
- Identify data that is sent as uniform arrays → switch to TOON.
- Run LLMLingua‑2 on any prompt longer than ~500 tokens.
- Measure token usage before & after conversion/compression.
- Iterate: fine‑tune field ordering in TOON or adjust compression aggressiveness in LLMLingua‑2.
2.9× Compression (2×‑5× Ratio)
A 1,000‑token prompt compressed to 200 tokens isn’t just cheaper—it’s faster. Your users get responses quicker, you pay less, and everyone wins.
If you’re building Retrieval‑Augmented Generation (RAG) systems, LLMLingua‑2 is a game‑changer. RAG applications often pull 10‑20 document chunks to answer a single question, which means massive context to send to your LLM.
LLMLingua mitigates the “lost in the middle” issue in LLMs, enhancing long‑context information processing. By compressing retrieved context, you keep all the important information while dramatically reducing token count.
LLMLingua has been integrated into LangChain and LlamaIndex, two widely‑used RAG frameworks.
Installation
pip install llmlingua
Basic Compression
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)
context = """
The quarterly financial report shows strong growth in Q4 2024.
Revenue increased by 28% compared to Q3, primarily driven by
enterprise sales. Operating costs decreased by 12% due to
improved efficiency measures. Customer retention improved to 96%,
while new customer acquisition grew by 34%. The product team
shipped five major features that significantly increased user
engagement metrics across all segments...
"""
question = "What were the main growth drivers in Q4?"
prompt = f"{context}\n\nQuestion: {question}"
compressed = compressor.compress_prompt(
prompt,
rate=0.5, # Target 50 % compression
force_tokens=['\n', '?'] # Preserve important formatting
)
print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt: {compressed['compressed_prompt']}")
With LangChain RAG
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMLinguaCompressor
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
compressor = LLMLinguaCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,
)
compressed_docs = compression_retriever.get_relevant_documents(
"What are the key findings from the research?"
)
For Agentic AI
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)
agent_instructions = """
You are a financial analysis agent with access to market data,
company financials, and industry reports. Your task is to identify
investment opportunities by analyzing revenue trends, profit margins,
market positioning, competitive advantages, and growth potential.
Consider both quantitative metrics and qualitative factors...
"""
compressed = compressor.compress_prompt(
agent_instructions,
rate=0.4 # 60 % compression
)
agent_prompt = f"{compressed['compressed_prompt']}\n\nTask: Analyze Tesla's Q4 performance"
When to Use TOON vs. LLMLingua‑2
| Use Case | Recommended Tool |
|---|---|
| Structured data with repeated fields (customer lists, product catalogs, DB results) | TOON |
| Tabular or semi‑tabular data (sales reports, analytics, inventory) | TOON |
| AI agents processing data (arrays of objects with the same structure) | TOON |
| API responses (JSON from backend) | TOON |
| Long text prompts (instructions, explanations, guidelines) | LLMLingua‑2 |
| RAG systems (compress retrieved document context) | LLMLingua‑2 |
| Natural language (meeting transcripts, reports, articles) | LLMLingua‑2 |
| Multi‑step reasoning (complex chain‑of‑thought prompts) | LLMLingua‑2 |
| Sophisticated GenAI apps combining structured data & lengthy instructions | Both |
| High‑volume systems (thousands of AI API calls daily) | Both |
| Cost‑sensitive applications (token efficiency impacts profitability) | Both |
Combining TOON & LLMLingua‑2
from toon_py import encode
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)
sales_data = [
{"month": "Oct", "revenue": 450_000, "customers": 1_245, "churn": 23},
{"month": "Nov", "revenue": 485_000, "customers": 1_312, "churn": 19},
{"month": "Dec", "revenue": 520_000, "customers": 1_398, "churn": 15},
]
toon_data = encode(sales_data)
instructions = """
Analyze the quarterly sales performance considering seasonal trends,
customer acquisition costs, competitive landscape changes, and
market conditions. Compare with historical data from the past
three years. Identify key growth drivers and potential risks.
Provide actionable recommendations for the sales team based on
data‑driven insights and market analysis...
"""
compressed_instructions = compressor.compress_prompt(instructions, rate=0.5)
final_prompt = f"""
{compressed_instructions['compressed_prompt']}
Q4 Sales Data:
{toon_data}
Question: What's the trend and what should we do next quarter?
"""
# Send `final_prompt` to your LLM of choice.
Bottom Line
Building with AI isn’t just about model capabilities—it’s about sustainable economics. Leveraging LLMLingua‑2 for long‑form text and TOON for structured data gives you maximum token efficiency, lower costs, and faster responses. 🚀
AI Application Costs vs. Revenue
If token costs grow faster than revenue, your AI application will fail. TOON and LLMLingua‑2 give you breathing room. They let you:
- Ship features faster without constantly optimizing for token costs
- Scale sustainably as your user base grows
- Compete effectively even against companies with bigger budgets
- Build richer experiences because you’re not cutting features to save tokens
Both technologies are production‑ready, open‑source, and actively maintained.
TOON
- Installation:
pip install toon-py - Multiple language implementations available
- Integration time: ~5 minutes to add to existing applications
LLMLingua‑2
- Installation:
pip install llmlingua - Integrated with LangChain and LlamaIndex
- Backed by Microsoft Research with ongoing development
Getting Started (No Full Rewrite Needed)
- Identify your most expensive API calls (log tokens per endpoint).
- Test:
- Use TOON on structured‑data endpoints.
- Use LLMLingua‑2 on text‑heavy prompts.
- Measure actual savings (tokens before vs. after).
- Roll out gradually across your application.
Why It Matters
The AI revolution is expensive. Smart developers are finding ways to make it affordable. TOON and LLMLingua‑2 are two of the most effective tools available today.
Start cutting your API bills now.