[Paper] Distilling Feedback into Memory-as-a-Tool

Published: 1 month ago (January 9, 2026 at 12:26 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.05960v1

Overview

The paper introduces “Distilling Feedback into Memory‑as‑a‑Tool,” a framework that lets large language models (LLMs) turn fleeting critiques they receive during inference into permanent, searchable guidelines. By storing these distilled insights in a file‑based memory and letting the model invoke them as tools, the approach achieves the same quality as heavyweight test‑time refinement pipelines while cutting inference cost dramatically.

Key Contributions

Memory‑as‑a‑Tool (MaT) architecture: a lightweight, file‑system‑style store that holds distilled feedback guidelines for rapid retrieval.
Agent‑controlled tool calls: the LLM decides when to read from or write to the memory, treating it like an external utility rather than a static prompt.
Rubric Feedback Bench: a new benchmark dataset that evaluates how well models can learn from rubric‑based feedback across multiple tasks.
Cost‑effective performance: empirical results show MaT‑augmented LLMs reach the accuracy of full test‑time refinement with up to 70 % fewer compute cycles.
Generalizable pipeline: the framework works with any off‑the‑shelf LLM and does not require fine‑tuning, making it easy to plug into existing systems.

Methodology

Feedback Collection: During a standard inference pass, the LLM generates an answer and then receives a short critique (e.g., “Your explanation missed the edge case about null inputs”).
Distillation Step: The model processes the critique and extracts a concise guideline (e.g., “Always check for null before accessing fields”).
Memory Write: The guideline is saved as a plain‑text file in a hierarchical directory that reflects the task or domain.
Tool‑Call Decision: On subsequent inputs, the LLM can issue a read_memory tool call, retrieving the most relevant guidelines based on a similarity query.
Guideline‑Guided Generation: The retrieved guidelines are injected into the prompt as context, steering the model toward better answers without re‑running the full refinement loop.

The whole loop is orchestrated by a lightweight agent that decides when to read, write, or ignore the memory, keeping the process fully differentiable from the underlying LLM.

Results & Findings

Model	Baseline (no feedback)	Test‑time Refinement	MaT‑augmented LLM
GPT‑3.5	68.2 %	78.5 %	77.9 %
LLaMA‑2‑13B	61.4 %	71.0 %	70.6 %

Accuracy: MaT matches or slightly trails the best refinement pipelines (within 0.6 % absolute).
Inference Cost: MaT reduces token usage by ~55 % and GPU time by ~70 % compared to running a full refinement step for each query.
Speed: End‑to‑end latency drops from ~1.8 s per query (refinement) to ~0.6 s (MaT).
Scalability: Memory size grows linearly with the number of distinct guidelines; retrieval remains fast thanks to simple lexical similarity and optional vector indexing.

Practical Implications

Developer Tooling: IDE assistants or code review bots can store “gotchas” from past reviews and instantly apply them to new suggestions, cutting down on repeated prompting.
Customer Support: Chatbots can accumulate policy clarifications or FAQ tweaks as guidelines, delivering higher‑quality answers without re‑training.
Education Platforms: Adaptive tutoring systems can remember rubric‑based feedback for each student and reuse it to give faster, personalized hints.
Cost‑Sensitive Deployments: SaaS providers can lower cloud‑compute bills by swapping expensive multi‑turn refinement for a cheap memory lookup, enabling real‑time LLM services at scale.

Limitations & Future Work

Memory Bloat: As guidelines accumulate, retrieval may become noisy; the paper suggests pruning strategies but does not fully explore them.
Domain Transfer: Guidelines distilled in one domain (e.g., programming) may not generalize well to another without explicit re‑contextualization.
Tool‑Call Overhead: While lightweight, the agent’s decision logic adds a small constant overhead that could matter in ultra‑low‑latency settings.
Future Directions: The authors plan to investigate hierarchical memory structures, automatic guideline summarization, and integration with retrieval‑augmented generation (RAG) pipelines to further boost scalability and cross‑domain applicability.

Authors

Víctor Gallego

Paper Information

arXiv ID: 2601.05960v1
Categories: cs.CL
Published: January 9, 2026
PDF: Download PDF

[Paper] Distilling Feedback into Memory-as-a-Tool

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning