ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Published: 3 weeks ago (April 8, 2026 at 10:27 AM EDT)

6 min read

Source: Hugging Face Blog

Source: Hugging Face Blog

Back to Articles

Contributors

TL;DR

Most AI agents re‑read transcripts instead of learning principles, so they repeat mistakes and don’t transfer lessons to new situations.
ALTK‑Evolve turns raw agent trajectories into reusable guidelines.
In benchmarks, the approach boosted reliability—especially on hard tasks (Δ +14.2 % on AppWorld)—and on multi‑step tasks, without bloating context.

The “eternal intern” problem

Imagine a brilliant line‑cook who has memorized every cookbook but forgets your kitchen every morning. They don’t remember that your oven runs hot, or that regulars like extra salt; they’ll follow a recipe card yet freeze when you’re out of lemons.

That’s most AI agents: excellent at following prompts, poor at accumulating wisdom about their environment. Feeding yesterday’s logs back into the prompt just makes them re‑read history; it doesn’t help them generalize from it.

A junior needs different recipes for vinaigrette and duck à l’orange. A chef learns “acid balances fat” and applies it everywhere. Likewise, reliable agents should distill principles from experience and apply them to new tasks, not just near‑duplicates of old ones.

This long‑term memory subsystem does exactly that: it converts interaction traces into candidate guidelines, filters for quality, and injects only relevant guidance at the moment of action. Agents need principles, not transcripts.

A recent MIT study found that 95 % of pilots fail because agents don’t adapt and learn on the job. ALTK‑Evolve addresses this learning gap using long‑term episodic memory to help agents reason better.

Solution: long‑term memory with ALTK‑Evolve

Evolve is a memory system for AI agents that helps them improve over time by generating and using guidelines from previous executions.

How it works

The system runs as a continuous loop with two complementary flows:

Downward flow (observation & extraction)
- Capture full agent trajectories (user utterances, thoughts, tool calls, results) in an Interaction Layer (e.g., Langfuse or any OpenTelemetry‑based observability tool).
- Pluggable extractors mine traces for structural patterns and persist them as candidate entities.
Upward flow (refinement & retrieval)
- A background consolidate‑and‑score job merges duplicates, prunes weak rules, and boosts proven strategies, evolving a high‑quality library of entities such as guidelines, policies, and SOPs.
- Retrieval pulls only the relevant items via the Interaction Layer and injects them back into context at the Application Layer.

architecture (1)

Why it works

Teaches judgment: Converts one‑off events into portable strategies that transfer across tasks.
Controls noise: Scoring keeps memory lean and useful, not a growing junk drawer.
Progressive disclosure: Retrieval is just‑in‑time, not stuffing everything into the context.

Results: better reliability, especially on hard tasks

We evaluated the framework on AppWorld, where agents complete realistic multi‑step tasks via APIs (average 9.5 API calls across 1.8 apps). Hard cases require more complex control flow.

A ReAct agent received the task instruction plus the top 5 retrieved guidelines generated on a prior run (train/dev) and was tested on an unseen partition (test‑normal). We report Scenario Goal Completion (SGC), a strict consistency metric requiring success across variants.

Difficulty	Baseline SGC	+ Memory SGC	Δ
Easy	79.0 %	84.2 %	+5.2
Medium	56.2 %	62.5 %	+6.3
Hard	19.1 %	33.3 %	+14.2
Aggregate	50.0 %	58.9 %	+8.9

Key take‑aways

Generalization: The agent improves on unseen Test‑Normal tasks, showing it learns principles rather than memorizing recipes.
Complexity scaling: The harder the task, the larger the benefit from concise learned guidelines (Hard tasks saw a 74 % relative increase in success).
Consistency: SGC gains exceed raw pass‑rate improvements, reducing flaky behavior across scenario variants.

More details are available in the paper: https://arxiv.org/abs/2603.10600.

Getting started (choose your path)

You have a choice in how to integrate ALTK‑Evolve into your agent.

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)

Install the plugin into Claude Code:

claude plugin marketplace add AgentToolkit/altk-evolve
claude plugin install evolve@evolve-marketplace

That’s it! The plugin extracts entities from trajectories and stores them as files on your filesystem.

(Continue with the rest of the integration instructions as needed.)

Uses Claude Code’s hooks for automatic retrieval

Prefer to watch instead of read? See the short Evolve‑Lite Claude Code walkthrough (video):
Demo

Check out the walkthroughs here for examples of how to learn with Claude Code in Lite mode.

Lite mode is easy to test‑drive but has limitations. For example, it doesn’t glean insights across agent sessions or perform consolidation and garbage collection of entities. The low‑code and pro‑code versions below address these limitations.

There are also one‑step integrations with Codex and IBM Bob. Try them out!

Low‑code with a ReAct agent

Add a single altk_evolve.auto import and flip a flag to emit traces to an Arize Phoenix UI. Then sync traces to generate improvement guidelines without changing your current stack. It works with popular LLM clients and agent frameworks (e.g., OpenAI, LiteLLM, and Hugging Face agents), so you keep your current stack and simply gain visibility.

Hands‑on examples:
– showcases different framework integrations.
Low‑code tracing documentation:

Pro‑code with CUGA

We integrated ALTK‑Evolve directly into CUGA via MCP to create a tight, low‑overhead learning loop.

Before each run: the get_guidelines MCP tool surfaces task‑specific steering and reduces trial‑and‑error.
After the run: CUGA sends back structured execution traces via save_trajectory, allowing Evolve to learn from what actually happened and improve future guidance.

The result is an integration that gets better over time while staying transparent, composable, and easy to adopt.

Prefer a visual tour? Watch the CUGA integration walkthrough:
video

Try it & tell us what your agent learned

Your agent shouldn’t wake up as an intern every morning. This approach helps it learn on the job.
If you’re using Claude Code, Codex, or IBM Bob, try it out in minutes and see how it improves your agent.

Star the repo – it helps others discover the project and directly guides what we build next.

Resources

Code:
Docs:
Quick‑start tutorials:
Feedback & ideas: Open a GitHub issue or join the discussions. Concrete use cases, benchmarks, and integration requests are especially helpful.

Watch the demos

Claude Code walkthrough (video): Demo
OpenAI Codex walkthrough (video): Demo
IBM Bob demo walkthrough (video): Demo
CUGA integration walkthrough: video

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Back to Articles

Contributors

TL;DR

The “eternal intern” problem

Solution: long‑term memory with ALTK‑Evolve

How it works

Why it works

Results: better reliability, especially on hard tasks

Key take‑aways

Getting started (choose your path)

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)

Uses Claude Code’s hooks for automatic retrieval

Low‑code with a ReAct agent

Pro‑code with CUGA

Try it & tell us what your agent learned

Watch the demos

Related posts

Using skills

Introducing the Child Safety Blueprint

Neuroscientist's AI-Powered Startup Aims To Transform Human Cognition With Perfect, Infinite Memory

AI models are terrible at betting on soccer—especially xAI Grok

Back to Articles

Contributors

TL;DR

The “eternal intern” problem

Solution: long‑term memory with ALTK‑Evolve

How it works

Why it works

Results: better reliability, especially on hard tasks

Key take‑aways

Getting started (choose your path)

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)

Uses Claude Code’s hooks for automatic retrieval

Low‑code with a ReAct agent

Pro‑code with CUGA

Try it & tell us what your agent learned

Watch the demos

Related posts

Using skills

Introducing the Child Safety Blueprint

Neuroscientist's AI-Powered Startup Aims To Transform Human Cognition With Perfect, Infinite Memory

AI models are terrible at betting on soccer—especially xAI Grok

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)