[Paper] Distilling Feedback into Memory-as-a-Tool

Published: (January 9, 2026 at 12:26 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05960v1

Overview

The paper introduces “Distilling Feedback into Memory‑as‑a‑Tool,” a framework that lets large language models (LLMs) turn fleeting critiques they receive during inference into permanent, searchable guidelines. By storing these distilled insights in a file‑based memory and letting the model invoke them as tools, the approach achieves the same quality as heavyweight test‑time refinement pipelines while cutting inference cost dramatically.

Key Contributions

  • Memory‑as‑a‑Tool (MaT) architecture: a lightweight, file‑system‑style store that holds distilled feedback guidelines for rapid retrieval.
  • Agent‑controlled tool calls: the LLM decides when to read from or write to the memory, treating it like an external utility rather than a static prompt.
  • Rubric Feedback Bench: a new benchmark dataset that evaluates how well models can learn from rubric‑based feedback across multiple tasks.
  • Cost‑effective performance: empirical results show MaT‑augmented LLMs reach the accuracy of full test‑time refinement with up to 70 % fewer compute cycles.
  • Generalizable pipeline: the framework works with any off‑the‑shelf LLM and does not require fine‑tuning, making it easy to plug into existing systems.

Methodology

  1. Feedback Collection: During a standard inference pass, the LLM generates an answer and then receives a short critique (e.g., “Your explanation missed the edge case about null inputs”).
  2. Distillation Step: The model processes the critique and extracts a concise guideline (e.g., “Always check for null before accessing fields”).
  3. Memory Write: The guideline is saved as a plain‑text file in a hierarchical directory that reflects the task or domain.
  4. Tool‑Call Decision: On subsequent inputs, the LLM can issue a read_memory tool call, retrieving the most relevant guidelines based on a similarity query.
  5. Guideline‑Guided Generation: The retrieved guidelines are injected into the prompt as context, steering the model toward better answers without re‑running the full refinement loop.

The whole loop is orchestrated by a lightweight agent that decides when to read, write, or ignore the memory, keeping the process fully differentiable from the underlying LLM.

Results & Findings

ModelBaseline (no feedback)Test‑time RefinementMaT‑augmented LLM
GPT‑3.568.2 %78.5 %77.9 %
LLaMA‑2‑13B61.4 %71.0 %70.6 %
  • Accuracy: MaT matches or slightly trails the best refinement pipelines (within 0.6 % absolute).
  • Inference Cost: MaT reduces token usage by ~55 % and GPU time by ~70 % compared to running a full refinement step for each query.
  • Speed: End‑to‑end latency drops from ~1.8 s per query (refinement) to ~0.6 s (MaT).
  • Scalability: Memory size grows linearly with the number of distinct guidelines; retrieval remains fast thanks to simple lexical similarity and optional vector indexing.

Practical Implications

  • Developer Tooling: IDE assistants or code review bots can store “gotchas” from past reviews and instantly apply them to new suggestions, cutting down on repeated prompting.
  • Customer Support: Chatbots can accumulate policy clarifications or FAQ tweaks as guidelines, delivering higher‑quality answers without re‑training.
  • Education Platforms: Adaptive tutoring systems can remember rubric‑based feedback for each student and reuse it to give faster, personalized hints.
  • Cost‑Sensitive Deployments: SaaS providers can lower cloud‑compute bills by swapping expensive multi‑turn refinement for a cheap memory lookup, enabling real‑time LLM services at scale.

Limitations & Future Work

  • Memory Bloat: As guidelines accumulate, retrieval may become noisy; the paper suggests pruning strategies but does not fully explore them.
  • Domain Transfer: Guidelines distilled in one domain (e.g., programming) may not generalize well to another without explicit re‑contextualization.
  • Tool‑Call Overhead: While lightweight, the agent’s decision logic adds a small constant overhead that could matter in ultra‑low‑latency settings.
  • Future Directions: The authors plan to investigate hierarchical memory structures, automatic guideline summarization, and integration with retrieval‑augmented generation (RAG) pipelines to further boost scalability and cross‑domain applicability.

Authors

  • Víctor Gallego

Paper Information

  • arXiv ID: 2601.05960v1
  • Categories: cs.CL
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »