[Paper] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Published: 2 months ago (February 10, 2026 at 01:51 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper CODE‑SHARP introduces a novel approach for reinforcement‑learning agents to continuously discover and master new skills without hand‑crafted reward functions. By converting large foundation models (e.g., LLMs) into “reward programmers,” the system incrementally expands a hierarchical archive of executable reward codes. This enables a single goal‑conditioned agent to solve progressively longer tasks in the open‑world Craftax game.

Key Contributions

Hierarchical Reward Programs (SHARP):
A directed‑graph archive where each node contains a piece of Python‑style reward code generated by a foundation model, enabling composable and reusable skill definitions.
Continuous Open‑ended Skill Evolution:
An iterative loop that (1) samples goals, (2) asks the foundation model to propose new reward programs, (3) evaluates them with the agent, and (4) adds successful programs to the archive, effectively “growing” the skill set over time.
Goal‑conditioned Agent Trained Solely on Generated Rewards:
Demonstrates that a single policy can learn to satisfy any reward program in the archive, eliminating the need for task‑specific policies.
FM‑based Planner for Long‑Horizon Composition:
Uses the same foundation model as a high‑level planner that selects and sequences discovered skills to reach complex goals.
Empirical Gains in Craftax:
The composed system outperforms both pretrained agents and expert policies by ~134 % on average across a suite of long‑horizon tasks.
Open‑source Release:
Code, trained models, and demonstration videos are made publicly available.

Methodology

Skill Archive as Executable Code
- Each skill is represented by a short reward function (e.g., reward = 1 if player collects wood).
- Skills are stored in a directed graph where edges capture “refinements” or “compositions” (e.g., collect wood → craft plank).
Foundation‑Model‑Driven Generation
- A large language model receives a textual description of a desired behavior (or the current archive state) and outputs a candidate reward program.
- The model can also suggest how to combine existing skills into a higher‑level program.
Evaluation Loop
- The goal‑conditioned RL agent receives a sampled reward program and learns to maximize it using standard RL (e.g., PPO).
- If the agent reaches a predefined performance threshold on that reward, the program is promoted into the archive.
Planning Over the Archive
- For a user‑provided high‑level goal (e.g., “build a house”), the same foundation model searches the graph for a sequence of reward programs that, when executed in order, achieve the goal.
- The agent then follows the plan, executing each skill’s reward function in turn.
Environment
- Experiments are conducted in Craftax, a procedurally‑generated, Minecraft‑style sandbox where agents must gather resources, craft items, and survive over long horizons.

Results & Findings

Metric	Baseline (pre‑trained)	Expert Policy	CODE‑SHARP (composed)
Success rate on 10‑step tasks	42 %	55 %	71 %
Success rate on 30‑step tasks	18 %	27 %	64 %
Average episode return (normalized)	1.0	1.2	2.34
Sample efficiency (episodes to 80 % success)	1.8 M	2.3 M	0.9 M

The agent trained only on generated rewards learned to solve novel, longer‑horizon goals that were never explicitly defined during training.
The FM planner could compose low‑level skills into high‑level plans that the single policy executed flawlessly, demonstrating emergent hierarchical behavior.
Ablation studies confirmed that both hierarchical structuring of the archive and LLM‑driven generation are essential; removing either drops performance by >30 %.

Practical Implications

Rapid Prototyping of Game AI – Developers can let an LLM suggest reward functions for new game mechanics, automatically expanding an agent’s repertoire without writing custom reward code.
Robotics & Automation – In domains where tasks evolve (e.g., warehouse robots encountering new object types), CODE‑SHARP can continuously generate and validate new task definitions.
Tooling for RL Researchers – The hierarchical archive offers a reusable library of reward programs that can be shared across projects, reducing duplication of effort.
Cost‑Effective Skill Scaling – Because only one goal‑conditioned policy is needed, memory and compute footprints stay low even as the skill set grows.
Explainability – Reward programs are human‑readable snippets of code, making it easier to audit what the agent is being asked to optimize.

Limitations & Future Work

Reliance on LLM Quality – The diversity and correctness of generated reward programs depend heavily on the underlying foundation model; biased or erroneous outputs can pollute the archive.
Scalability of the Archive – As the graph grows, searching for optimal skill compositions may become computationally expensive. Smarter indexing or pruning strategies are needed.
Domain Transfer – Experiments are limited to the Craftax sandbox; applying CODE‑SHARP to real‑world robotics or other simulation platforms may require additional safety checks and domain‑specific grounding.
Reward Mis‑specification – The framework assumes that the generated reward code aligns with the intended semantics. Future work could incorporate verification steps (e.g., formal methods) to catch mismatches.

Overall, CODE‑SHARP opens a promising path toward truly open‑ended skill discovery, turning large language models into collaborative designers of reinforcement‑learning objectives.

Authors

Pierluigi Vito Amadori
Richard Bornemann
Antoine Cully

Paper Information

Item	Details
arXiv ID	`2602.10085v1`
Categories	`cs.AI`
Published	February 10, 2026
PDF	Download PDF

[Paper] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Study: Self-generated Agent Skills are useless

[Paper] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

[Paper] Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

[Paper] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning