[Paper] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Published: 3 days ago (February 19, 2026 at 12:01 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17547v1

Overview

The paper presents KLong, an open‑source large language model (LLM) agent designed to tackle extremely long‑horizon tasks—think multi‑step research projects or complex software development pipelines that can span thousands of tokens. By combining a novel trajectory‑splitting supervised fine‑tuning (SFT) stage with a progressive reinforcement‑learning (RL) schedule, the authors achieve performance that rivals (and in some cases exceeds) much larger commercial models.

Key Contributions

Cold‑start recipe: A comprehensive SFT pipeline that awakens basic “agentic” abilities in a base LLM before any long‑horizon training.
Research‑Factory pipeline: An automated data‑generation system that scrapes research papers, builds evaluation rubrics, and creates high‑quality long‑trajectory examples distilled from Claude 4.5 Sonnet (Thinking).
Trajectory‑splitting SFT: A method that preserves early context while progressively truncating later context and overlapping sub‑trajectories, enabling stable fine‑tuning on ultra‑long sequences.
Progressive RL scheduler: A multi‑stage RL regime that gradually extends the allowed “timeout” (i.e., number of reasoning steps) so the model learns to plan farther ahead without collapsing.
Empirical dominance: KLong‑106B outperforms the 1‑trillion‑parameter Kimi K2 Thinking by +11.28 % on PaperBench and shows consistent gains on coding suites such as SWE‑bench Verified and MLE‑bench.

Methodology

Cold‑start SFT – The base model (≈106 B parameters) is first fine‑tuned on a diverse set of short‑to‑medium tasks (question answering, code generation, planning) to give it a solid foundation of tool use, self‑reflection, and instruction following.
Data generation with Research‑Factory
- Crawl a large corpus of research papers.
- Automatically extract a task rubric (goal, success criteria, intermediate milestones).
- Use Claude 4.5 Sonnet to produce step‑by‑step solution trajectories that can be tens of thousands of tokens long.
Trajectory‑splitting SFT
- Split each ultra‑long trajectory into overlapping windows.
- Early windows retain the full preceding context; later windows drop older tokens gradually, keeping a “sliding‑window” of relevant history.
- Train the model on all windows simultaneously, which teaches it to maintain long‑range coherence without hitting GPU memory limits.
Progressive RL
- Stage 1: RL with a short timeout (e.g., 256 tokens) to reinforce basic planning.
- Stage 2‑N: Incrementally increase the timeout (512 → 1024 → 2048 …) so the policy learns to allocate resources across longer horizons.
- Reward function blends rubric‑based task completion, tool‑use efficiency, and self‑critique scores.

Results & Findings

Benchmark	KLong‑106B	Kimi K2 Thinking (1T)	Relative Δ
PaperBench (research‑task suite)	+11.28 %	Baseline	+11.28 %
SWE‑bench Verified (software engineering)	+6.4 %	–	+6.4 %
MLE‑bench (machine‑learning engineering)	+5.9 %	–	+5.9 %

Generalization: Gains persist even when the evaluation tasks differ from the training distribution (e.g., coding vs. research).
Stability: The trajectory‑splitting SFT prevents catastrophic forgetting of early context, a common failure mode when fine‑tuning on very long sequences.
Efficiency: KLong achieves these results with a 106 B model—roughly a tenth of the parameters of the competing 1 T model—showcasing a favorable compute‑to‑performance ratio.

Practical Implications

Research assistants: Developers can embed KLong in literature‑review pipelines to automatically generate structured research plans, experiment designs, and even draft sections of papers.
Long‑running code generation: In complex software projects (e.g., multi‑module systems, data pipelines), KLong can maintain context across hundreds of files, reducing the need for manual prompt engineering or chunk‑by‑chunk stitching.
Tool‑augmented agents: Because KLong learns to invoke external tools (search APIs, code interpreters) over long horizons, it can serve as a more reliable backbone for autonomous agents in DevOps, CI/CD automation, or cloud‑resource provisioning.
Open‑source accessibility: The released code and data pipelines let teams replicate the training recipe on their own hardware, enabling custom domain‑specific long‑horizon agents without paying for trillion‑parameter APIs.

Limitations & Future Work

Data bias: Training trajectories are distilled from Claude 4.5 Sonnet, so any systematic biases or hallucinations in that model may propagate into KLong.
Memory constraints: Although trajectory‑splitting mitigates GPU limits, training still requires high‑end hardware (multiple A100/H100 GPUs) for the 106 B model.
Evaluation scope: Benchmarks focus on research and coding tasks; real‑world deployment in domains like legal reasoning or scientific simulation remains untested.
Future directions proposed by the authors include: expanding the Research‑Factory to non‑paper domains (e.g., design documents), integrating retrieval‑augmented generation for even longer contexts, and exploring curriculum‑learning strategies that adapt the RL timeout based on task difficulty rather than a fixed schedule.

Authors

Yue Liu
Zhiyuan Hu
Flood Sung
Jiaheng Zhang
Bryan Hooi

Paper Information

arXiv ID: 2602.17547v1
Categories: cs.AI, cs.CL
Published: February 19, 2026
PDF: Download PDF

[Paper] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$rightarrow$LLM Pipelines?