[Paper] Online Experiential Learning for Language Models

Published: (March 17, 2026 at 01:57 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.16856v1

Overview

The paper introduces Online Experiential Learning (OEL), a new framework that lets large language models (LLMs) keep getting better by learning from the very interactions they have with real users. Instead of relying solely on offline fine‑tuning with curated datasets, OEL extracts “experience” from deployment logs, distills it into the model, and repeats the cycle—turning every chat, query, or game move into a training signal.

Key Contributions

  • Experiential Knowledge Extraction: A method to turn raw user‑model interaction trajectories into compact, transferable representations that capture what the model actually learned during deployment.
  • On‑Policy Context Distillation: A lightweight, privacy‑preserving way to update model parameters using the extracted knowledge without needing direct access to the user‑side environment.
  • Iterative Online Learning Loop: Demonstrates that repeatedly applying extraction → distillation → redeployment yields steady gains in task performance and token efficiency.
  • Empirical Validation Across Scales: Experiments on text‑based game environments show consistent improvements for models ranging from a few hundred million to several billion parameters, covering both “thinking” (requiring planning) and “non‑thinking” tasks.
  • Insights on Knowledge vs. Raw Data: Shows that distilled experiential knowledge is far more effective for fine‑tuning than feeding raw interaction logs back into the model.

Methodology

  1. Data Collection (User‑Side): While the model serves users (e.g., playing a text adventure), it logs each interaction as a trajectory: prompt, model response, user feedback, and any reward signal (success/failure).
  2. Experiential Knowledge Extraction:
    • The trajectories are processed by a lightweight encoder that abstracts away surface details and captures what the model learned (e.g., successful strategies, common failure patterns).
    • The result is a set of compact “experience vectors” that are easy to store and transmit.
  3. On‑Policy Context Distillation (Server‑Side):
    • The current model (the policy model) is fine‑tuned on the extracted vectors using a contrastive/distillation loss that aligns the model’s internal representations with the experiential knowledge.
    • Crucially, this step does not require replaying the original user interactions, preserving privacy and reducing bandwidth.
  4. Iterative Loop: The updated model is redeployed, collects higher‑quality trajectories, and the cycle repeats. Over successive rounds the model’s policy becomes more aligned with the real‑world tasks it faces.

Results & Findings

  • Performance Gains: Across 4 model sizes (0.3B–6B parameters) and 2 game families, OEL improved task success rates by 4–12% per iteration.
  • Token Efficiency: The updated models solved tasks using 10–18% fewer tokens, indicating better planning and less “trial‑and‑error” chatter.
  • Out‑of‑Distribution Robustness: Even though training focused on the specific game environments, OEL did not degrade performance on unrelated benchmarks (e.g., standard QA datasets).
  • Knowledge vs. Raw Trajectories: Feeding the distilled experience vectors into the model yielded up to 3× higher accuracy improvements than directly fine‑tuning on the raw logs.
  • On‑Policy Consistency: When the knowledge extractor was out‑of‑sync with the policy model (e.g., using an older model to extract experience), gains vanished, underscoring the need for the extractor to reflect the current policy.

Practical Implications

  • Continuous Improvement for SaaS LLMs: Companies can embed OEL into their API services, turning every user request into a training signal without exposing raw logs.
  • Reduced Annotation Costs: Eliminates the need for costly human‑in‑the‑loop labeling; the model learns from its own successes and failures.
  • Privacy‑First Learning: Since only abstracted experience vectors are transmitted, user data stays on‑device, aligning with GDPR‑style regulations.
  • Faster Deployment Cycles: The lightweight distillation step can be run on modest GPU clusters, enabling near‑real‑time model updates.
  • Better Resource Utilization: Higher token efficiency translates to lower inference costs for both providers and end‑users, especially in latency‑sensitive applications (chatbots, virtual assistants).

Limitations & Future Work

  • Domain Specificity: Experiments were limited to text‑based games; applying OEL to open‑domain chat or code generation may require richer reward signals.
  • Extractor Complexity: The current knowledge extractor is a simple encoder; more sophisticated architectures (e.g., graph‑based planners) could capture richer strategies.
  • Scalability to Multi‑Modal Settings: Extending OEL to vision‑language or audio‑language models remains an open challenge.
  • Safety & Alignment: While OEL improves task performance, the authors note the need for safeguards to prevent the model from reinforcing undesirable behaviors observed in the wild.

Online Experiential Learning opens a promising path toward truly self‑improving language models—turning every deployment into a learning opportunity while respecting user privacy.

Authors

  • Tianzhu Ye
  • Li Dong
  • Qingxiu Dong
  • Xun Wu
  • Shaohan Huang
  • Furu Wei

Paper Information

  • arXiv ID: 2603.16856v1
  • Categories: cs.CL
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »