两种高效技术降低 AI 代币成本：TOON 与 Microsoft 的 LLMLingua-2

发布: 1个月前 (2025年12月23日 GMT+8 06:08)

12 分钟阅读

原文: Dev.to

Source: Dev.to

（请提供您希望翻译的正文内容，我将为您翻译成简体中文，并保持原有的格式、Markdown 语法以及技术术语不变。）

为什么代币成本重要

构建 AI 应用从未如此易得。OpenAI 的 GPT‑4、Anthropic 的 Claude 和 Google 的 Gemini 已经把几年前看似科幻的技术变成了日常现实。

企业现在正在创建：

智能代理
检索增强生成（RAG）系统
以规模解决复杂业务挑战的生成式 AI 应用

但一旦从原型转向生产，代币成本就会显现：

每一次对大型语言模型（LLM）的 API 调用 都会按代币计费。
代币≈一个单词或单词的一部分。
你需要为 输入（数据 + 指令） 以及 输出（模型的响应） 同时付费。

结果： 许多开发者发现——往往为时已晚——他们的数据格式化和提示设计导致代币使用率膨胀 40‑60 %。

两种降低 Token 浪费的技术

技术	功能描述	Token 节省
TOON (Token‑Oriented Object Notation)	为大型语言模型构建的数据序列化格式。	结构化数据可节省 30‑60 % 的 Token
LLMLingua‑2 (Microsoft)	一种提示压缩引擎，可在保留含义的同时去除 50‑80 % 的提示内容。	提示可节省 50‑80 % 的 Token

两者解决 不同的问题，但目标相同：显著降低 AI 成本。

Source: …

TOON – 面向 Token 的对象表示法

什么是 TOON？

TOON 是一种专为大语言模型设计的序列化格式。它融合了：

类似 YAML 的缩进 用于嵌套对象
类似 CSV 的表格布局 用于统一数组

与 JSON 为每个数组元素重复字段名不同，TOON 一次声明字段名，随后只列出对应的值。

JSON 与 TOON（员工示例）

JSON（传统）

{
  "team": [
    {"id": 1, "name": "Tej B", "role": "engineer"},
    {"id": 2, "name": "Praveen V", "role": "designer"},
    {"id": 3, "name": "Partha G", "role": "manager"}
  ]
}

TOON（高效）

team[3]{id,name,role}:
1,Tej B,engineer
2,Praveen V,designer
3,Partha G,manager

相同的数据，Token 数大幅减少。

性能亮点

73.9 % 的准确率，相比 JSON 的 69.7 % 节省 39.6 % 的 Token。
大语言模型实际上 比 JSON 更能理解 TOON。

实际成本示例

数据	格式	大约 Token 数	Token 减少率
100 条产品 × 8 字段	JSON	~12,000	—
100 条产品 × 8 字段	TOON	~6,000	≈ 50 %

如果每天需要进行数千次此类调用，每月可节省数百至数千美元。

理想使用场景

结构统一的对象数组（例如客户记录、产品目录、交易日志）
发送给 AI 代理的数据库查询结果
分析仪表盘、销售报告、库存数据
任何需要 LLM 处理的表格或半表格数据

注意： 对于深度嵌套或高度不统一的结构，JSON 仍可能更高效。TOON 是一种 专用工具，而非通用替代方案。

安装

pip install toon-py

基本 Python 用法

from toon_py import encode, decode

products = [
    {"id": 101, "name": "Laptop",   "price": 1299, "stock": 45},
    {"id": 102, "name": "Mouse",    "price":   29, "stock": 230},
    {"id": 103, "name": "Keyboard", "price":   89, "stock": 156}
]

# 编码为 TOON
toon_data = encode(products)
print(toon_data)
# ──> [3]{id,name,price,stock}:
#     101,Laptop,1299,45
#     102,Mouse,29,230
#     103,Keyboard,89,156

# 在提示中使用
prompt = f"Analyze this inventory:\n{toon_data}\n\nWhich products need restocking?"
# 将 `prompt` 发送给 OpenAI、Claude、Gemini 等
# → 节省 40‑60 % 的 Token！

命令行界面

# JSON → TOON
toon input.json -o output.toon

# TOON → JSON
toon data.toon -o output.json

LLMLingua‑2 – 提示压缩

什么是 LLMLingua‑2？

LLMLingua‑2（Microsoft）解决的是 提示长度 而非数据序列化。它将压缩视为 token‑classification（标记分类）问题，使用 Transformer 编码器在完整的双向上下文中决定哪些标记是必需的。

通过 从 GPT‑4 的数据蒸馏 进行训练，因此它准确了解 LLM 所需的内容。
像一位 专业编辑，在保留意义的前提下去除填充词、冗余短语和不必要的上下文。

压缩效果

最高可达 20 倍 的压缩率，性能损失极小。
比原始 LLMLingua 快 3‑6 倍。
将端到端延迟提升约 ≈ 1.6 倍。

适用场景

用于 AI 代理的长系统指令
文档中的上下文段落（例如法律文本、研究论文）
少量示例或演示
任何接近模型 token 限制的提示

综合运用

目标	工具	帮助方式
减少结构化数据的 token 数量	TOON	紧凑序列化（字段名仅声明一次）
减少提示/指令的 token 数量	LLMLingua‑2	智能去除冗余措辞，同时保留语义
整体成本降低	两者皆可	token 减少 50‑80 % → 降低 API 费用，提升响应速度

开发者快速检查清单

识别以统一数组形式发送的数据 → 切换至 TOON。
对任何超过约 500 token 的提示使用 LLMLingua‑2。
测量转换/压缩前后的 token 使用量。
迭代：在 TOON 中微调字段顺序或在 LLMLingua‑2 中调整压缩力度。

2.9× 压缩 (2×‑5× 比例)

将 1,000 token 的提示压缩到 200 token 不仅更便宜——还能更快。用户能够更快收到回复，你的费用也更低，大家都受益。

如果你在构建检索增强生成（RAG）系统，LLMLingua‑2 将是一个改变游戏规则的利器。RAG 应用通常会提取 10‑20 个文档块来回答单个问题，这意味着需要向 LLM 发送大量上下文。

LLMLingua 缓解了 LLM 中的 “中间丢失” 问题，提升了长上下文信息的处理能力。通过压缩检索到的上下文，你可以保留所有重要信息，同时显著降低 token 数量。

LLMLingua 已经集成到 LangChain 和 LlamaIndex 两个广泛使用的 RAG 框架中。

安装

pip install llmlingua

基本压缩

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

context = """
The quarterly financial report shows strong growth in Q4 2024.
Revenue increased by 28% compared to Q3, primarily driven by
enterprise sales. Operating costs decreased by 12% due to
improved efficiency measures. Customer retention improved to 96%,
while new customer acquisition grew by 34%. The product team
shipped five major features that significantly increased user
engagement metrics across all segments...
"""

question = "What were the main growth drivers in Q4?"
prompt = f"{context}\n\nQuestion: {question}"

compressed = compressor.compress_prompt(
    prompt,
    rate=0.5,                     # Target 50 % compression
    force_tokens=['\n', '?']      # Preserve important formatting
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt: {compressed['compressed_prompt']}")

与 LangChain RAG 配合使用

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMLinguaCompressor
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

compressor = LLMLinguaCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

compressed_docs = compression_retriever.get_relevant_documents(
    "What are the key findings from the research?"
)

用于 Agentic AI

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

agent_instructions = """
You are a financial analysis agent with access to market data,
company financials, and industry reports. Your task is to identify
investment opportunities by analyzing revenue trends, profit margins,
market positioning, competitive advantages, and growth potential.
Consider both quantitative metrics and qualitative factors...
"""

compressed = compressor.compress_prompt(
    agent_instructions,
    rate=0.4  # 60 % compression
)

agent_prompt = f"{compressed['compressed_prompt']}\n\nTask: Analyze Tesla's Q4 performance"

何时使用 TOON 与 LLMLingua‑2

用例	推荐工具
结构化数据（包含重复字段，如客户列表、产品目录、数据库结果）	TOON
表格或半表格数据（销售报告、分析、库存）	TOON
AI 代理处理数据（具有相同结构的对象数组）	TOON
API 响应（后端 JSON）	TOON
长文本提示（指令、解释、指南）	LLMLingua‑2
RAG 系统（压缩检索到的文档上下文）	LLMLingua‑2
自然语言（会议记录、报告、文章）	LLMLingua‑2
多步骤推理（复杂的思考链提示）	LLMLingua‑2
复杂的生成式 AI 应用（结合结构化数据和冗长指令）	Both
高并发系统（每日数千次 AI API 调用）	Both
成本敏感的应用（令牌效率影响盈利）	Both

结合 TOON 与 LLMLingua‑2

from toon_py import encode
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

sales_data = [
    {"month": "Oct", "revenue": 450_000, "customers": 1_245, "churn": 23},
    {"month": "Nov", "revenue": 485_000, "customers": 1_312, "churn": 19},
    {"month": "Dec", "revenue": 520_000, "customers": 1_398, "churn": 15},
]
toon_data = encode(sales_data)

instructions = """
Analyze the quarterly sales performance considering seasonal trends,
customer acquisition costs, competitive landscape changes, and
market conditions. Compare with historical data from the past
three years. Identify key growth drivers and potential risks.
Provide actionable recommendations for the sales team based on
data‑driven insights and market analysis...
"""
compressed_instructions = compressor.compress_prompt(instructions, rate=0.5)

final_prompt = f"""
{compressed_instructions['compressed_prompt']}

Q4 Sales Data:
{toon_data}

Question: What's the trend and what should we do next quarter?
"""

# Send `final_prompt` to your LLM of choice.

关键要点

构建 AI 应用不仅仅关乎模型能力——更在于可持续的经济性。利用 LLMLingua‑2 处理长文本，配合 TOON 处理结构化数据，可实现最高的 token 效率、降低成本并加快响应速度。 🚀

AI 应用成本与收入

如果 token 成本增长速度快于收入，你的 AI 应用将会失败。TOON 和 LLMLingua‑2 为你提供了喘息的空间。它们可以帮助你：

更快地发布功能，而无需不断为 token 成本进行优化
随着用户基数的增长实现可持续扩展
即使面对预算更大的公司也能有效竞争
构建更丰富的体验，因为你不需要为了节省 token 而削减功能

这两项技术都是 生产就绪、开源、且 积极维护 的。

TOON

安装方式： pip install toon-py
提供多语言实现
集成时间： 大约 5 分钟即可添加到现有应用中

LLMLingua‑2

安装方式： pip install llmlingua
已与 LangChain 和 LlamaIndex 集成
由 Microsoft Research 支持，持续开发中

入门指南（无需完整重写）

识别您最昂贵的 API 调用（记录每个端点的 token 数）。
测试：
- 在结构化数据端点上使用 TOON。
- 在文本密集型提示上使用 LLMLingua‑2。
衡量实际节省（前后 token 数对比）。
逐步推行 于整个应用程序。

为什么重要

AI 革命成本高昂。聪明的开发者正在寻找让它变得负担得起的方法。TOON 和 LLMLingua‑2 是当今最有效的两款工具。

立即开始削减您的 API 费用。