ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Source: Hugging Face Blog
Back to Articles
Contributors
TL;DR
- Most AI agents re‑read transcripts instead of learning principles, so they repeat mistakes and don’t transfer lessons to new situations.
- ALTK‑Evolve turns raw agent trajectories into reusable guidelines.
- In benchmarks, the approach boosted reliability—especially on hard tasks (Δ +14.2 % on AppWorld)—and on multi‑step tasks, without bloating context.
The “eternal intern” problem
Imagine a brilliant line‑cook who has memorized every cookbook but forgets your kitchen every morning. They don’t remember that your oven runs hot, or that regulars like extra salt; they’ll follow a recipe card yet freeze when you’re out of lemons.
That’s most AI agents: excellent at following prompts, poor at accumulating wisdom about their environment. Feeding yesterday’s logs back into the prompt just makes them re‑read history; it doesn’t help them generalize from it.
A junior needs different recipes for vinaigrette and duck à l’orange. A chef learns “acid balances fat” and applies it everywhere. Likewise, reliable agents should distill principles from experience and apply them to new tasks, not just near‑duplicates of old ones.
This long‑term memory subsystem does exactly that: it converts interaction traces into candidate guidelines, filters for quality, and injects only relevant guidance at the moment of action. Agents need principles, not transcripts.
A recent MIT study found that 95 % of pilots fail because agents don’t adapt and learn on the job. ALTK‑Evolve addresses this learning gap using long‑term episodic memory to help agents reason better.
Solution: long‑term memory with ALTK‑Evolve
Evolve is a memory system for AI agents that helps them improve over time by generating and using guidelines from previous executions.
How it works
The system runs as a continuous loop with two complementary flows:
Downward flow (observation & extraction)
- Capture full agent trajectories (user utterances, thoughts, tool calls, results) in an Interaction Layer (e.g., Langfuse or any OpenTelemetry‑based observability tool).
- Pluggable extractors mine traces for structural patterns and persist them as candidate entities.
Upward flow (refinement & retrieval)
- A background consolidate‑and‑score job merges duplicates, prunes weak rules, and boosts proven strategies, evolving a high‑quality library of entities such as guidelines, policies, and SOPs.
- Retrieval pulls only the relevant items via the Interaction Layer and injects them back into context at the Application Layer.

Why it works
- Teaches judgment: Converts one‑off events into portable strategies that transfer across tasks.
- Controls noise: Scoring keeps memory lean and useful, not a growing junk drawer.
- Progressive disclosure: Retrieval is just‑in‑time, not stuffing everything into the context.
Results: better reliability, especially on hard tasks
We evaluated the framework on AppWorld, where agents complete realistic multi‑step tasks via APIs (average 9.5 API calls across 1.8 apps). Hard cases require more complex control flow.
A ReAct agent received the task instruction plus the top 5 retrieved guidelines generated on a prior run (train/dev) and was tested on an unseen partition (test‑normal). We report Scenario Goal Completion (SGC), a strict consistency metric requiring success across variants.
| Difficulty | Baseline SGC | + Memory SGC | Δ |
|---|---|---|---|
| Easy | 79.0 % | 84.2 % | +5.2 |
| Medium | 56.2 % | 62.5 % | +6.3 |
| Hard | 19.1 % | 33.3 % | +14.2 |
| Aggregate | 50.0 % | 58.9 % | +8.9 |
Key take‑aways
- Generalization: The agent improves on unseen Test‑Normal tasks, showing it learns principles rather than memorizing recipes.
- Complexity scaling: The harder the task, the larger the benefit from concise learned guidelines (Hard tasks saw a 74 % relative increase in success).
- Consistency: SGC gains exceed raw pass‑rate improvements, reducing flaky behavior across scenario variants.
More details are available in the paper: https://arxiv.org/abs/2603.10600.
Getting started (choose your path)
You have a choice in how to integrate ALTK‑Evolve into your agent.
No‑code with Claude Code, Codex, and IBM Bob (Lite mode)
Install the plugin into Claude Code:
claude plugin marketplace add AgentToolkit/altk-evolve
claude plugin install evolve@evolve-marketplaceThat’s it! The plugin extracts entities from trajectories and stores them as files on your filesystem.
(Continue with the rest of the integration instructions as needed.)
Uses Claude Code’s hooks for automatic retrieval
Prefer to watch instead of read? See the short Evolve‑Lite Claude Code walkthrough (video):
Demo
Check out the walkthroughs here for examples of how to learn with Claude Code in Lite mode.
Lite mode is easy to test‑drive but has limitations. For example, it doesn’t glean insights across agent sessions or perform consolidation and garbage collection of entities. The low‑code and pro‑code versions below address these limitations.
There are also one‑step integrations with Codex and IBM Bob. Try them out!
Low‑code with a ReAct agent
Add a single altk_evolve.auto import and flip a flag to emit traces to an Arize Phoenix UI. Then sync traces to generate improvement guidelines without changing your current stack. It works with popular LLM clients and agent frameworks (e.g., OpenAI, LiteLLM, and Hugging Face agents), so you keep your current stack and simply gain visibility.
Hands‑on examples:
– showcases different framework integrations.Low‑code tracing documentation:
Pro‑code with CUGA
We integrated ALTK‑Evolve directly into CUGA via MCP to create a tight, low‑overhead learning loop.
- Before each run: the
get_guidelinesMCP tool surfaces task‑specific steering and reduces trial‑and‑error. - After the run: CUGA sends back structured execution traces via
save_trajectory, allowing Evolve to learn from what actually happened and improve future guidance.
The result is an integration that gets better over time while staying transparent, composable, and easy to adopt.
Prefer a visual tour? Watch the CUGA integration walkthrough:
video
Try it & tell us what your agent learned
Your agent shouldn’t wake up as an intern every morning. This approach helps it learn on the job.
If you’re using Claude Code, Codex, or IBM Bob, try it out in minutes and see how it improves your agent.
- Star the repo – it helps others discover the project and directly guides what we build next.
Resources
- Code:
- Docs:
- Quick‑start tutorials:
- Feedback & ideas: Open a GitHub issue or join the discussions. Concrete use cases, benchmarks, and integration requests are especially helpful.