[Paper] Evolutionary Context Search for Automated Skill Acquisition

Published: (February 17, 2026 at 07:47 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16113v1

Overview

Large Language Models (LLMs) still struggle to “learn” new facts after they’re deployed. Even when fresh documents are available, simply retrieving them at inference time often doesn’t translate into better answers. The paper Evolutionary Context Search for Automated Skill Acquisition proposes a new way to automatically discover the most useful pieces of context for a given task—without touching the model’s weights—by treating context selection as an evolutionary optimization problem.

Key Contributions

  • Evolutionary Context Search (ECS): an algorithm that iteratively mutates and recombines sets of retrieved documents, using performance on a tiny development set as the fitness signal.
  • Weight‑free adaptation: ECS requires only forward‑passes (inference calls) to the LLM, avoiding expensive fine‑tuning or gradient‑based updates.
  • Cross‑model transferability: Contexts evolved with one LLM (Gemini‑3‑Flash) were shown to improve unrelated models (Claude Sonnet, DeepSeek), indicating model‑agnostic utility.
  • Empirical gains: On two benchmark suites, ECS lifted BackendBench accuracy by 27 % and τ‑bench airline by 7 % over standard similarity‑based retrieval.
  • Practical recipe: The authors release a lightweight implementation that can be plugged into existing Retrieval‑Augmented Generation pipelines.

Methodology

  1. Initial Retrieval – For each query, a conventional similarity search (e.g., BM25 or dense vector similarity) pulls a pool of candidate documents.
  2. Population Encoding – Each “individual” in the evolutionary population is a binary mask indicating which candidates are included in the final prompt.
  3. Fitness Evaluation – The masked context set is concatenated to the query and fed to the LLM. The model’s answer is scored against a small held‑out dev set (e.g., exact match or BLEU). No gradients are computed; only the output quality matters.
  4. Evolutionary Operators
    • Selection: top‑k individuals survive.
    • Crossover: masks are combined to exchange document selections.
    • Mutation: random flips add or drop documents, encouraging exploration.
  5. Iterative Search – The process repeats for a fixed number of generations (often < 20), after which the best‑performing mask is used as the final context for that query.

Because the fitness function is directly tied to downstream performance, ECS can surface “non‑obvious” document pairings—e.g., a seemingly irrelevant technical spec that, when combined with a policy doc, resolves a subtle ambiguity.

Results & Findings

BenchmarkBaseline (similarity‑only)ECS‑augmentedRelative Gain
BackendBench (accuracy)62 %79 %+27 %
τ‑bench airline (τ‑score)0.710.76+7 %

Cross‑model transfer: Contexts evolved for Gemini‑3‑Flash improved Claude Sonnet by 5 % and DeepSeek by 4 % on the same tasks, confirming that the discovered context sets are not tied to a single model’s idiosyncrasies.

Efficiency: The entire search for a batch of 500 queries took roughly 2 × the inference cost of a single baseline run—far cheaper than a full fine‑tune (which can require orders of magnitude more GPU hours).

Practical Implications

  • Plug‑and‑play augmentation: Teams can wrap ECS around any existing Retrieval‑Augmented Generation service (e.g., LangChain, LlamaIndex) to automatically boost answer quality without re‑training.
  • Rapid skill rollout: When a product needs to incorporate new regulations, API docs, or internal policies, ECS can discover the optimal context mix in hours rather than weeks of manual prompt engineering.
  • Cost‑effective scaling: Since ECS only needs inference calls, it can run on the same hardware used for production serving, avoiding the high compute bills of fine‑tuning large models.
  • Model‑agnostic knowledge sharing: Organizations can evolve contexts on a cheaper “seed” model and then reuse them across premium models, maximizing ROI on expensive LLM subscriptions.

Limitations & Future Work

  • Dependency on a dev set: ECS needs a small, representative validation set to guide the search; constructing this set can be non‑trivial for niche domains.
  • Search overhead: Although far cheaper than fine‑tuning, the evolutionary loop still adds latency—making it more suitable for batch or near‑real‑time scenarios rather than ultra‑low‑latency APIs.
  • Scalability of candidate pool: The method assumes a manageable number of retrieved documents (≈10‑20). Scaling to hundreds of candidates may require smarter sampling or hierarchical evolution.
  • Future directions the authors suggest include: integrating reinforcement learning to replace the static dev set, exploring multi‑objective fitness (e.g., balancing accuracy and token budget), and applying ECS to multimodal contexts (images, code snippets).

Authors

  • Qi Sun
  • Stefan Nielsen
  • Rio Yokota
  • Yujin Tang

Paper Information

  • arXiv ID: 2602.16113v1
  • Categories: cs.NE, cs.LG
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »