[Paper] Meta Context Engineering via Agentic Skill Evolution

Published: 3 months ago (January 29, 2026 at 06:22 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.21557v1

Overview

The paper introduces Meta Context Engineering (MCE), a new bi‑level framework that lets large language models (LLMs) automatically improve the way they are prompted at inference time. Instead of relying on hand‑crafted “context engineering” recipes, MCE lets a meta‑agent evolve both the skills for shaping prompts and the prompt artifacts themselves, yielding consistently better performance across a variety of tasks.

Key Contributions

Bi‑level architecture: separates a meta‑level agent that evolves context‑engineering skills from a base‑level agent that applies those skills to generate and refine prompts.
Agentic crossover operator: a novel deliberative search that recombines past skills, their executions, and evaluation signals to create stronger engineering strategies.
Flexible context representation: treats prompts as mutable files and code rather than rigid schemas, enabling richer, task‑specific modifications.
Broad empirical validation: tested on five heterogeneous domains (e.g., code generation, reasoning, retrieval‑augmented QA) in both offline and online settings.
Significant performance gains: achieves 5.6 %–53.8 % relative improvement over the strongest existing agentic CE baselines (average +16.9 %).
Efficiency & transferability: demonstrates lower context‑token usage, faster convergence, and the ability to transfer learned skills across domains.

Methodology

Base‑level agent

A standard LLM that receives a context file (prompt, few‑shot examples, tool definitions, etc.) and produces an answer. After each rollout it records:

The context it received
The generated answer
A scalar evaluation (e.g., reward model score, task metric)

Meta‑level agent

Operates on a population of CE skills (small programs or templates that manipulate the context file). Each iteration:

Selection – picks high‑scoring skills from previous generations.
Agentic crossover – combines fragments of two or more parent skills, guided by a deliberative search over their execution histories (what worked, what didn’t).
Mutation – optionally injects random edits (e.g., adding a new example, tweaking a system message).

Co‑evolution loop

The meta‑agent produces a new skill, the base‑level agent runs it on a batch of tasks, and the resulting performance feeds back as fitness for the next meta‑generation. The context itself is stored as editable files (JSON, Python snippets, markdown) so the skill can add, delete, or rewrite sections programmatically.

Training regimes

Offline: a fixed dataset of tasks is used; the loop runs until convergence.
Online: tasks arrive continuously; the meta‑agent updates skills on‑the‑fly, allowing adaptation to distribution shift.

Results & Findings

Domain	Baseline (state‑of‑the‑art CE)	MCE (mean relative gain)
Code synthesis	42.1 % pass@1	+23.4 %
Multi‑step reasoning	68.5 % accuracy	+12.7 %
Retrieval‑augmented QA	71.2 % F1	+16.9 %
Dialogue planning	55.3 % success	+9.8 %
Structured data extraction	61.0 % F1	+5.6 %

Consistency: Gains were observed across all five domains, confirming that the meta‑level evolution is not task‑specific.
Context efficiency: MCE reduced the average number of tokens per prompt by ~18 % while maintaining higher scores, thanks to smarter pruning and reuse of useful examples.
Transferability: Skills learned on code synthesis transferred to reasoning tasks with only minor fine‑tuning, indicating a shared “engineering intuition” captured by the meta‑agent.
Training speed: The bi‑level loop converged 2–3× faster than prior agentic CE methods because crossover leverages already‑validated sub‑skills.

Practical Implications

Developer‑friendly prompt pipelines: Teams can plug MCE into existing LLM inference services to automatically generate and maintain high‑quality prompts without manual trial‑and‑error.
Cost reduction: Fewer prompt tokens mean lower API bills, especially for high‑throughput applications (e.g., code assistants, chatbots).
Rapid adaptation: When a product’s use‑case drifts (new API, updated schema), MCE can evolve new context‑engineering skills on‑the‑fly, shortening the time‑to‑market for feature updates.
Reusable skill libraries: Organizations can curate a catalog of CE skills (e.g., “add few‑shot examples for arithmetic”, “inject tool definitions for retrieval”) that the meta‑agent recombines, fostering knowledge sharing across teams.
Better debugging: Because the context is stored as editable files, developers can inspect the exact prompt modifications that led to performance gains, improving transparency compared to opaque, monolithic prompt‑tuning methods.

Limitations & Future Work

Computation overhead: The meta‑level search adds extra compute (especially during crossover) compared with static prompt engineering; scaling to extremely large corpora may require more efficient search heuristics.
Evaluation dependency: MCE relies on a reliable scalar reward (e.g., a downstream metric or a learned reward model). Noisy or misaligned rewards can misguide skill evolution.
Skill interpretability: While the generated skills are code‑like, they can become complex after many generations, making manual inspection harder.

Future directions

Integrate neural architecture search techniques to prune the skill search space.
Explore multi‑objective optimization (e.g., balancing performance vs. token budget).
Apply MCE to multimodal models where context includes images or audio.
Study human‑in‑the‑loop extensions where developers can seed or steer skill evolution with domain expertise.

Authors

Haoran Ye
Xuning He
Vincent Arak
Haonan Dong
Guojie Song

Paper Information

arXiv ID: 2601.21557v1
Categories: cs.AI, cs.NE
Published: January 29, 2026
PDF: Download PDF