[Paper] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Published: (February 19, 2026 at 12:01 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17547v1

Overview

The paper presents KLong, an open‑source large language model (LLM) agent designed to tackle extremely long‑horizon tasks—think multi‑step research projects or complex software development pipelines that can span thousands of tokens. By combining a novel trajectory‑splitting supervised fine‑tuning (SFT) stage with a progressive reinforcement‑learning (RL) schedule, the authors achieve performance that rivals (and in some cases exceeds) much larger commercial models.

Key Contributions

  • Cold‑start recipe: A comprehensive SFT pipeline that awakens basic “agentic” abilities in a base LLM before any long‑horizon training.
  • Research‑Factory pipeline: An automated data‑generation system that scrapes research papers, builds evaluation rubrics, and creates high‑quality long‑trajectory examples distilled from Claude 4.5 Sonnet (Thinking).
  • Trajectory‑splitting SFT: A method that preserves early context while progressively truncating later context and overlapping sub‑trajectories, enabling stable fine‑tuning on ultra‑long sequences.
  • Progressive RL scheduler: A multi‑stage RL regime that gradually extends the allowed “timeout” (i.e., number of reasoning steps) so the model learns to plan farther ahead without collapsing.
  • Empirical dominance: KLong‑106B outperforms the 1‑trillion‑parameter Kimi K2 Thinking by +11.28 % on PaperBench and shows consistent gains on coding suites such as SWE‑bench Verified and MLE‑bench.

Methodology

  1. Cold‑start SFT – The base model (≈106 B parameters) is first fine‑tuned on a diverse set of short‑to‑medium tasks (question answering, code generation, planning) to give it a solid foundation of tool use, self‑reflection, and instruction following.
  2. Data generation with Research‑Factory
    • Crawl a large corpus of research papers.
    • Automatically extract a task rubric (goal, success criteria, intermediate milestones).
    • Use Claude 4.5 Sonnet to produce step‑by‑step solution trajectories that can be tens of thousands of tokens long.
  3. Trajectory‑splitting SFT
    • Split each ultra‑long trajectory into overlapping windows.
    • Early windows retain the full preceding context; later windows drop older tokens gradually, keeping a “sliding‑window” of relevant history.
    • Train the model on all windows simultaneously, which teaches it to maintain long‑range coherence without hitting GPU memory limits.
  4. Progressive RL
    • Stage 1: RL with a short timeout (e.g., 256 tokens) to reinforce basic planning.
    • Stage 2‑N: Incrementally increase the timeout (512 → 1024 → 2048 …) so the policy learns to allocate resources across longer horizons.
    • Reward function blends rubric‑based task completion, tool‑use efficiency, and self‑critique scores.

Results & Findings

BenchmarkKLong‑106BKimi K2 Thinking (1T)Relative Δ
PaperBench (research‑task suite)+11.28 %Baseline+11.28 %
SWE‑bench Verified (software engineering)+6.4 %+6.4 %
MLE‑bench (machine‑learning engineering)+5.9 %+5.9 %
  • Generalization: Gains persist even when the evaluation tasks differ from the training distribution (e.g., coding vs. research).
  • Stability: The trajectory‑splitting SFT prevents catastrophic forgetting of early context, a common failure mode when fine‑tuning on very long sequences.
  • Efficiency: KLong achieves these results with a 106 B model—roughly a tenth of the parameters of the competing 1 T model—showcasing a favorable compute‑to‑performance ratio.

Practical Implications

  • Research assistants: Developers can embed KLong in literature‑review pipelines to automatically generate structured research plans, experiment designs, and even draft sections of papers.
  • Long‑running code generation: In complex software projects (e.g., multi‑module systems, data pipelines), KLong can maintain context across hundreds of files, reducing the need for manual prompt engineering or chunk‑by‑chunk stitching.
  • Tool‑augmented agents: Because KLong learns to invoke external tools (search APIs, code interpreters) over long horizons, it can serve as a more reliable backbone for autonomous agents in DevOps, CI/CD automation, or cloud‑resource provisioning.
  • Open‑source accessibility: The released code and data pipelines let teams replicate the training recipe on their own hardware, enabling custom domain‑specific long‑horizon agents without paying for trillion‑parameter APIs.

Limitations & Future Work

  • Data bias: Training trajectories are distilled from Claude 4.5 Sonnet, so any systematic biases or hallucinations in that model may propagate into KLong.
  • Memory constraints: Although trajectory‑splitting mitigates GPU limits, training still requires high‑end hardware (multiple A100/H100 GPUs) for the 106 B model.
  • Evaluation scope: Benchmarks focus on research and coding tasks; real‑world deployment in domains like legal reasoning or scientific simulation remains untested.
  • Future directions proposed by the authors include: expanding the Research‑Factory to non‑paper domains (e.g., design documents), integrating retrieval‑augmented generation for even longer contexts, and exploring curriculum‑learning strategies that adapt the RL timeout based on task difficulty rather than a fixed schedule.

Authors

  • Yue Liu
  • Zhiyuan Hu
  • Flood Sung
  • Jiaheng Zhang
  • Bryan Hooi

Paper Information

  • arXiv ID: 2602.17547v1
  • Categories: cs.AI, cs.CL
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »