[Paper] SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Published: (February 4, 2026 at 12:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04811v1

Overview

The paper SE‑Bench: Benchmarking Self‑Evolution with Knowledge Internalization tackles a core challenge for modern AI agents: can they truly learn new tools or libraries on the fly and later use that knowledge without any external help? By turning the familiar NumPy package into a “mystery” library with scrambled function names, the authors create a clean testbed where success hinges entirely on whether the model has internalized the new API during training.

Key Contributions

  • SE‑Bench diagnostic suite – a reproducible environment that hides a NumPy‑like library behind random identifiers, forcing agents to memorize the API rather than rely on pre‑existing knowledge.
  • Open‑Book Paradox discovery – showing that providing reference docs during training actually harms long‑term retention; “closed‑book” training forces the model to compress the knowledge into its weights.
  • RL Gap analysis – empirical evidence that standard PPO‑style reinforcement learning struggles to fully internalize new knowledge because of clipping and negative‑gradient effects.
  • Self‑Play + Supervised Fine‑Tuning (SFT) pipeline – demonstrates that agents can generate their own noisy tasks and still learn the hidden API, provided they are fine‑tuned with supervised data rather than pure RL.
  • Open‑source release – code, data, and evaluation scripts are publicly available, enabling the community to benchmark future self‑evolution methods.

Methodology

  1. Obfuscation of NumPy – The authors take the NumPy library, rename every function/class with a random token (e.g., np.meanzq_42), and scramble the accompanying documentation.
  2. Training regimes
    • Closed‑Book Training: The model never sees the documentation while learning; it must infer the API solely from interaction traces.
    • Open‑Book Training: The model has access to the docs during fine‑tuning (used as a baseline).
    • Reinforcement Learning: PPO is applied where the reward is binary (correct/incorrect solution).
    • Self‑Play: The model generates its own coding prompts, solves them, and then is fine‑tuned on the generated pairs.
  3. Evaluation – After training, the model receives simple coding problems (e.g., “compute the sum of an array”) but no documentation. Success means the model can call the obfuscated functions correctly, proving the knowledge is truly stored in its parameters.

The setup isolates two confounding factors that plague existing benchmarks: (a) prior exposure to the same API in pre‑training data, and (b) task difficulty that could mask a model’s recall ability.

Results & Findings

Training ModeSuccess Rate on Closed‑Book Test*
Open‑Book (docs visible)~30 %
Closed‑Book (no docs)≈ 78 %
PPO RL≈ 45 %
Self‑Play + SFT≈ 73 %

*Success = producing a syntactically correct program that runs and yields the expected output.

  • Open‑Book Paradox: Access to docs during fine‑tuning reduces the model’s ability to internalize the API, likely because the optimizer leans on the external reference instead of compressing the mapping into weights.
  • RL Gap: PPO’s clipping mechanism and the sparse binary reward prevent the gradient signal from fully propagating the nuanced mapping between random identifiers and their semantics.
  • Self‑Play Viability: When the model creates its own training examples and then undergoes supervised fine‑tuning, it reaches performance close to closed‑book training, proving that self‑generated data can be a viable curriculum for knowledge internalization.

Practical Implications

  • Tool‑aware assistants – Future code‑generation assistants (e.g., Copilot‑style models) could be trained to learn new libraries on the fly, enabling rapid adaptation to proprietary or emerging APIs without re‑training on massive corpora.
  • On‑device learning – Closed‑book training suggests that lightweight fine‑tuning on a user’s device (with no internet access) can embed new capabilities directly into the model, improving privacy and latency.
  • Continuous deployment pipelines – Companies can feed a model a short “knowledge dump” (e.g., internal SDK docs) and expect the model to internalize it, reducing the need for manual prompt engineering or external doc look‑ups.
  • Self‑play curricula for LLMs – The success of self‑generated tasks plus SFT opens a path to autonomous curriculum learning where a model continuously expands its toolbox without human‑written examples.

In short, SE‑Bench provides a concrete yardstick for measuring whether an AI system truly learns versus merely looks up information—a distinction that matters for reliability, security, and compliance in production AI systems.

Limitations & Future Work

  • Synthetic nature of the benchmark – The obfuscated NumPy library is still a relatively simple, well‑structured API; real‑world SDKs may have more irregular naming, side‑effects, and versioning quirks.
  • Scale of models – Experiments were run on medium‑sized language models; it remains unclear how the findings translate to multi‑billion‑parameter LLMs.
  • Reward design – The binary reward in RL is coarse; richer, graded rewards (e.g., partial credit for correct function usage) might narrow the RL gap.
  • Long‑term retention – The study focuses on a single fine‑tuning episode; future work could examine catastrophic forgetting when multiple new APIs are introduced sequentially.

The authors plan to extend SE‑Bench to multi‑library scenarios, explore curriculum‑aware RL algorithms, and test the pipeline on larger, commercially deployed models.

Authors

  • Jiarui Yuan
  • Tailin Jin
  • Weize Chen
  • Zeyuan Liu
  • Zhiyuan Liu
  • Maosong Sun

Paper Information

  • arXiv ID: 2602.04811v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Reinforced Attention Learning

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending th...

[Paper] Trust The Typical

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fre...