[Paper] 面向语言模型的在线体验式学习

性能提升: 在 4 种模型规模（0.3B–6B 参数）和 2 类游戏中，OEL 每次迭代将任务成功率提升了 4–12%。
标记效率: 更新后的模型使用了 10–18% 更少的标记（tokens）完成任务，表明规划更好，“试错”对话更少。
分布外鲁棒性: 尽管训练聚焦于特定游戏环境，OEL 并未降低在无关基准（例如标准问答数据集）上的表现。
知识 vs. 原始轨迹: 将蒸馏后的经验向量输入模型，相比直接在原始日志上微调，可实现最高 3 倍的准确率提升。
在策略一致性: 当知识提取器与策略模型不同步（例如使用旧模型提取经验）时，提升效果消失，凸显提取器必须反映当前策略的重要性。

发布: 3天前 (2026年3月18日 GMT+8 01:57)

7 分钟阅读

原文: arXiv

Source: arXiv - 2603.16856v1

Overview

本文介绍了 在线体验学习 (OEL)，这是一种新框架，使大型语言模型（LLM）能够通过与真实用户的交互不断改进。OEL 不再仅仅依赖于使用精选数据集进行离线微调，而是从部署日志中提取“经验”，将其蒸馏进模型，并重复此循环——将每一次聊天、查询或游戏动作都转化为训练信号。

Data Collection (User‑Side): While the model serves users (e.g., playing a text adventure), it logs each interaction as a trajectory: prompt, model response, user feedback, and any reward signal (success/failure).
Experiential Knowledge Extraction:
- The trajectories are processed by a lightweight encoder that abstracts away surface details and captures what the model learned (e.g., successful strategies, common failure patterns).
- The result is a set of compact “experience vectors” that are easy to store and transmit.
On‑Policy Context Distillation (Server‑Side):
- The current model (the policy model) is fine‑tuned on the extracted vectors using a contrastive/distillation loss that aligns the model’s internal representations with the experiential knowledge.
- Crucially, this step does not require replaying the original user interactions, preserving privacy and reducing bandwidth.
Iterative Loop: The updated model is redeployed, collects higher‑quality trajectories, and the cycle repeats. Over successive rounds the model’s policy becomes more aligned with the real‑world tasks it faces.

在线体验学习为真正自我改进的语言模型开辟了有前景的道路——将每一次部署都转化为学习机会，同时尊重用户隐私。