[Paper] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Source: arXiv - 2602.06025v1
Overview
Large‑language‑model (LLM) agents are starting to use external memory so they can reason over information that doesn’t fit in a single context window. Existing pipelines usually build this memory offline and without looking at the actual query, which can waste compute and even drop information that’s crucial for the current task. The paper Learning Query‑Aware Budget‑Tier Routing for Runtime Agent Memory introduces BudgetMem, a runtime‑centric memory system that lets developers explicitly trade off answer quality against the cost of constructing and using memory.
Key Contributions
- Budget‑tiered memory modules – each module (e.g., retrieval, summarisation, reasoning) is offered in three pre‑defined “budget” levels (Low / Mid / High) that differ in complexity, inference behavior, or model size.
- Lightweight routing policy – a compact neural controller, trained with reinforcement learning, decides per‑query which tier to use for each module, thereby shaping the overall cost‑performance curve.
- Unified testbed – the authors wrap the three budget‑tier strategies (implementation, reasoning, capacity) into a single framework, enabling systematic comparison across diverse benchmarks (LoCoMo, LongMemEval, HotpotQA).
- Empirical gains – BudgetMem outperforms strong baselines when the budget is generous and, more importantly, delivers a superior accuracy‑vs‑cost frontier when resources are tight.
- Analytical insights – the study disentangles when each tiering axis (method complexity, inference style, model capacity) is most beneficial, offering practical guidance for system designers.
Methodology
- Memory Modules – The system decomposes the agent’s memory pipeline into reusable components (e.g., document retrieval, passage summarisation, answer generation).
- Budget Tiers
- Implementation tier: same algorithmic idea but with cheaper vs. richer implementations (e.g., BM25 vs. dense retrieval).
- Reasoning tier: different inference behaviours such as “single‑shot” prompting vs. multi‑step chain‑of‑thought.
- Capacity tier: smaller vs. larger underlying models (e.g., 7B vs. 13B).
- Router Policy – A small transformer‑based policy network receives a query embedding and lightweight statistics about the current memory state, then outputs a tier choice for each module.
- Training – The router is trained with reinforcement learning where the reward balances task accuracy (e.g., exact‑match on HotpotQA) against a budget penalty proportional to compute time or token usage.
- Evaluation – Experiments sweep across three budget regimes (tight, moderate, generous) and compare against static‑tier baselines (always Low, always High) and prior runtime memory approaches.
Results & Findings
| Benchmark | High‑budget (max tier) | Tight‑budget (Low tier) | Budget‑aware (BudgetMem) |
|---|---|---|---|
| LoCoMo | +3.2 % EM over baseline | –1.1 % EM vs. baseline | +2.0 % EM while staying under budget |
| LongMemEval | +4.5 % F1 | –0.8 % F1 | +3.1 % F1 with 30 % less compute |
| HotpotQA | +5.0 % EM | –0.5 % EM | +4.2 % EM at 40 % lower latency |
- Accuracy‑cost frontier: BudgetMem consistently dominates static baselines, delivering higher scores for the same compute budget and lower compute for the same score.
- Tier‑axis analysis:
- Implementation tiers shine when the budget is extremely tight (cheap retrieval still finds the right document).
- Reasoning tiers give the biggest boost in the mid‑budget regime (chain‑of‑thought reasoning adds value without exploding cost).
- Capacity tiers dominate only when the budget is generous, confirming that scaling model size is not the most efficient lever under constraints.
Practical Implications
- Dynamic cost control – Deployments (e.g., SaaS LLM assistants, chatbots) can expose a “performance budget” knob to customers, letting the system automatically dial up/down the memory sophistication per request.
- Resource‑aware scaling – Cloud providers can schedule cheaper memory pipelines for low‑priority queries while reserving high‑tier modules for premium or time‑critical tasks, improving overall throughput.
- Reduced hallucinations – By routing queries that need deep reasoning to higher reasoning tiers, agents can retrieve and synthesize more relevant context, mitigating the common “out‑of‑scope” errors.
- Plug‑and‑play architecture – Because BudgetMem treats each memory component as a modular block, existing retrieval or summarisation services can be swapped in with minimal engineering effort.
Limitations & Future Work
- Training overhead – The reinforcement‑learning router requires a separate optimisation phase; the authors note that the policy may need re‑training when new modules or datasets are added.
- Budget definition – The current experiments use compute time and token count as proxies for cost; real‑world deployments may need to incorporate memory bandwidth, GPU allocation, or monetary pricing.
- Generalisation – The router’s decisions are evaluated on the same benchmark families used for training; cross‑domain robustness (e.g., from QA to code generation) remains an open question.
- Future directions suggested include: (1) meta‑learning the router to adapt on‑the‑fly to novel tasks, (2) extending the tier space to incorporate retrieval‑augmented generation models, and (3) exploring multi‑objective optimization that jointly considers latency, energy, and user‑satisfaction metrics.
Authors
- Haozhen Zhang
- Haodong Yue
- Tao Feng
- Quanyu Long
- Jianzhu Bao
- Bowen Jin
- Weizhi Zhang
- Xiao Li
- Jiaxuan You
- Chengwei Qin
- Wenya Wang
Paper Information
- arXiv ID: 2602.06025v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: February 5, 2026
- PDF: Download PDF