[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents

Published: (December 26, 2025 at 12:15 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22087v1

Overview

Large‑language‑model (LLM) agents are increasingly being used to automate software‑engineering (SWE) tasks that span many steps and involve navigating massive code repositories. Existing agents typically keep adding every new interaction to a growing “prompt” (append‑only) or rely on ad‑hoc, passive compression tricks, which quickly leads to context overflow, loss of important semantics, and poorer reasoning. The paper “Context as a Tool: Context Management for Long‑Horizon SWE‑Agents” introduces CAT, a new paradigm that treats context handling as an explicit, callable tool that the agent can invoke whenever it needs to summarize, prune, or reorganize its memory. By doing so, CAT keeps the agent’s reasoning focused, scalable, and robust even under tight token budgets.

Key Contributions

  • CAT framework – formalizes a three‑layered context workspace (stable task semantics, condensed long‑term memory, high‑fidelity short‑term interactions) and exposes a context‑management tool that agents can call on demand.
  • Trajectory‑level supervision – a data‑generation pipeline (CAT‑GENERATOR) that injects realistic context‑management actions into full interaction traces, enabling supervised training of context‑aware agents.
  • SWE‑Compressor model – a specialized LLM trained with CAT‑GENERATOR data to learn when and how to compress historical trajectories into concise, actionable summaries.
  • Empirical validation – on the challenging SWE‑Bench‑Verified benchmark, SWE‑Compressor achieves a 57.6 % solved rate, outperforming ReAct‑style agents and static compression baselines while staying within a fixed token budget.
  • Demonstrated stability – the approach maintains consistent reasoning quality across long‑running sessions, mitigating semantic drift and context explosion.

Methodology

Structured Context Workspace

  • Stable Task Semantics: immutable high‑level description of the overall goal (e.g., “refactor authentication module”).
  • Condensed Long‑Term Memory: periodic summaries of earlier steps, stored in a compact form.
  • Short‑Term Interactions: the most recent dialogue and code snippets retained verbatim for fine‑grained reasoning.

Context‑Management as a Callable Tool

  • The agent can issue a compress_context() call at any point.
  • The tool takes the current workspace, decides what to summarize, and returns an updated, smaller representation.

CAT‑GENERATOR Pipeline

  1. Offline trajectory collection: gather full interaction logs from existing SWE agents.
  2. Annotation of compression points: automatically insert “compress” actions at logical milestones (e.g., after a module is fully explored).
  3. Supervised training data: each annotated step pairs the pre‑compression context with the desired post‑compression summary.

Training SWE‑Compressor

  • Fine‑tune a base LLM (e.g., Llama‑2‑13B) on the CAT‑GENERATOR dataset.
  • The model learns to predict when to compress and what summary to produce, conditioned on the three‑layer workspace.

Evaluation Protocol

  • Deploy the agent on SWE‑Bench‑Verified tasks with a hard token limit (e.g., 8 k tokens).
  • Compare success rates, token usage, and reasoning stability against ReAct agents (which rely on reactive tool calls) and static compression heuristics (e.g., truncation, fixed‑interval summarization).

Results & Findings

MetricCAT‑enabled SWE‑CompressorReAct‑based AgentStatic Compression
Solved Rate (✓)57.6 %42.3 %38.9 %
Avg. Tokens Used7.2 k (within budget)9.1 k (overrun)8.5 k
Reasoning Consistency (drop‑off)< 2 %12 %9 %
Compression Overhead (time per call)0.12 sN/AN/A
  • Higher success: By actively summarizing only when needed, the agent retains the most relevant information, leading to a 15‑point boost over the ReAct baseline.
  • Token efficiency: The workspace stays under the preset budget, preventing the “context explosion” that typically forces agents to truncate useful history.
  • Stability: Accuracy degradation across long horizons is dramatically reduced, confirming that proactive compression mitigates semantic drift.

Practical Implications

  • Scalable Code‑Assistants: Developers can embed CAT‑enabled agents in IDE plugins or CI pipelines, confident that the assistant will remain responsive even after dozens of back‑and‑forth edits across a large repo.
  • Cost‑Effective LLM Usage: By keeping token counts low without sacrificing performance, teams can lower inference costs on commercial LLM APIs (e.g., OpenAI, Anthropic).
  • Better Tool Integration: Since context management is a first‑class tool, it can be combined with other agent capabilities (e.g., test generation, bug localization) in a unified decision‑making loop.
  • Customizable Summaries: Organizations can tailor the compression policy (e.g., more aggressive for security‑critical code) by fine‑tuning the SWE‑Compressor on domain‑specific trajectories.
  • Reduced Hallucination: Maintaining a concise, high‑fidelity short‑term buffer helps the model stay grounded in the actual code, lowering the risk of generating incorrect patches.

Limitations & Future Work

  • Domain Generalization: The current training data focuses on open‑source Python/JavaScript projects; performance on other languages or highly proprietary codebases remains untested.
  • Compression Granularity: The model decides when to compress but does not expose fine‑grained control to users (e.g., “keep all function signatures”).
  • Offline Supervision Dependency: CAT‑GENERATOR requires a substantial corpus of annotated trajectories, which may be costly to produce for niche domains.

Future Directions

  • Extending CAT to multi‑agent collaboration scenarios where several agents share a common context workspace.
  • Exploring reinforcement‑learning‑based policies that dynamically balance compression aggressiveness against task success.
  • Integrating static analysis tools to enrich the condensed long‑term memory with semantic graphs, further improving reasoning fidelity.

Authors

  • Shukai Liu
  • Jian Yang
  • Bo Jiang
  • Yizhi Li
  • Jinyang Guo
  • Xianglong Liu
  • Bryan Dai

Paper Information

  • arXiv ID: 2512.22087v1
  • Categories: cs.CL
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »