[Paper] Distilling Feedback into Memory-as-a-Tool
Source: arXiv - 2601.05960v1
Overview
The paper introduces “Distilling Feedback into Memory‑as‑a‑Tool,” a framework that lets large language models (LLMs) turn fleeting critiques they receive during inference into permanent, searchable guidelines. By storing these distilled insights in a file‑based memory and letting the model invoke them as tools, the approach achieves the same quality as heavyweight test‑time refinement pipelines while cutting inference cost dramatically.
Key Contributions
- Memory‑as‑a‑Tool (MaT) architecture: a lightweight, file‑system‑style store that holds distilled feedback guidelines for rapid retrieval.
- Agent‑controlled tool calls: the LLM decides when to read from or write to the memory, treating it like an external utility rather than a static prompt.
- Rubric Feedback Bench: a new benchmark dataset that evaluates how well models can learn from rubric‑based feedback across multiple tasks.
- Cost‑effective performance: empirical results show MaT‑augmented LLMs reach the accuracy of full test‑time refinement with up to 70 % fewer compute cycles.
- Generalizable pipeline: the framework works with any off‑the‑shelf LLM and does not require fine‑tuning, making it easy to plug into existing systems.
Methodology
- Feedback Collection: During a standard inference pass, the LLM generates an answer and then receives a short critique (e.g., “Your explanation missed the edge case about null inputs”).
- Distillation Step: The model processes the critique and extracts a concise guideline (e.g., “Always check for null before accessing fields”).
- Memory Write: The guideline is saved as a plain‑text file in a hierarchical directory that reflects the task or domain.
- Tool‑Call Decision: On subsequent inputs, the LLM can issue a
read_memorytool call, retrieving the most relevant guidelines based on a similarity query. - Guideline‑Guided Generation: The retrieved guidelines are injected into the prompt as context, steering the model toward better answers without re‑running the full refinement loop.
The whole loop is orchestrated by a lightweight agent that decides when to read, write, or ignore the memory, keeping the process fully differentiable from the underlying LLM.
Results & Findings
| Model | Baseline (no feedback) | Test‑time Refinement | MaT‑augmented LLM |
|---|---|---|---|
| GPT‑3.5 | 68.2 % | 78.5 % | 77.9 % |
| LLaMA‑2‑13B | 61.4 % | 71.0 % | 70.6 % |
- Accuracy: MaT matches or slightly trails the best refinement pipelines (within 0.6 % absolute).
- Inference Cost: MaT reduces token usage by ~55 % and GPU time by ~70 % compared to running a full refinement step for each query.
- Speed: End‑to‑end latency drops from ~1.8 s per query (refinement) to ~0.6 s (MaT).
- Scalability: Memory size grows linearly with the number of distinct guidelines; retrieval remains fast thanks to simple lexical similarity and optional vector indexing.
Practical Implications
- Developer Tooling: IDE assistants or code review bots can store “gotchas” from past reviews and instantly apply them to new suggestions, cutting down on repeated prompting.
- Customer Support: Chatbots can accumulate policy clarifications or FAQ tweaks as guidelines, delivering higher‑quality answers without re‑training.
- Education Platforms: Adaptive tutoring systems can remember rubric‑based feedback for each student and reuse it to give faster, personalized hints.
- Cost‑Sensitive Deployments: SaaS providers can lower cloud‑compute bills by swapping expensive multi‑turn refinement for a cheap memory lookup, enabling real‑time LLM services at scale.
Limitations & Future Work
- Memory Bloat: As guidelines accumulate, retrieval may become noisy; the paper suggests pruning strategies but does not fully explore them.
- Domain Transfer: Guidelines distilled in one domain (e.g., programming) may not generalize well to another without explicit re‑contextualization.
- Tool‑Call Overhead: While lightweight, the agent’s decision logic adds a small constant overhead that could matter in ultra‑low‑latency settings.
- Future Directions: The authors plan to investigate hierarchical memory structures, automatic guideline summarization, and integration with retrieval‑augmented generation (RAG) pipelines to further boost scalability and cross‑domain applicability.
Authors
- Víctor Gallego
Paper Information
- arXiv ID: 2601.05960v1
- Categories: cs.CL
- Published: January 9, 2026
- PDF: Download PDF