[Paper] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Published: (February 10, 2026 at 01:51 PM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.10085v1

Overview

The paper CODE‑SHARP introduces a novel approach for reinforcement‑learning agents to continuously discover and master new skills without hand‑crafted reward functions. By converting large foundation models (e.g., LLMs) into “reward programmers,” the system incrementally expands a hierarchical archive of executable reward codes. This enables a single goal‑conditioned agent to solve progressively longer tasks in the open‑world Craftax game.

Key Contributions

  • Hierarchical Reward Programs (SHARP):
    A directed‑graph archive where each node contains a piece of Python‑style reward code generated by a foundation model, enabling composable and reusable skill definitions.

  • Continuous Open‑ended Skill Evolution:
    An iterative loop that (1) samples goals, (2) asks the foundation model to propose new reward programs, (3) evaluates them with the agent, and (4) adds successful programs to the archive, effectively “growing” the skill set over time.

  • Goal‑conditioned Agent Trained Solely on Generated Rewards:
    Demonstrates that a single policy can learn to satisfy any reward program in the archive, eliminating the need for task‑specific policies.

  • FM‑based Planner for Long‑Horizon Composition:
    Uses the same foundation model as a high‑level planner that selects and sequences discovered skills to reach complex goals.

  • Empirical Gains in Craftax:
    The composed system outperforms both pretrained agents and expert policies by ~134 % on average across a suite of long‑horizon tasks.

  • Open‑source Release:
    Code, trained models, and demonstration videos are made publicly available.

Methodology

  1. Skill Archive as Executable Code

    • Each skill is represented by a short reward function (e.g., reward = 1 if player collects wood).
    • Skills are stored in a directed graph where edges capture “refinements” or “compositions” (e.g., collect wood → craft plank).
  2. Foundation‑Model‑Driven Generation

    • A large language model receives a textual description of a desired behavior (or the current archive state) and outputs a candidate reward program.
    • The model can also suggest how to combine existing skills into a higher‑level program.
  3. Evaluation Loop

    • The goal‑conditioned RL agent receives a sampled reward program and learns to maximize it using standard RL (e.g., PPO).
    • If the agent reaches a predefined performance threshold on that reward, the program is promoted into the archive.
  4. Planning Over the Archive

    • For a user‑provided high‑level goal (e.g., “build a house”), the same foundation model searches the graph for a sequence of reward programs that, when executed in order, achieve the goal.
    • The agent then follows the plan, executing each skill’s reward function in turn.
  5. Environment

    • Experiments are conducted in Craftax, a procedurally‑generated, Minecraft‑style sandbox where agents must gather resources, craft items, and survive over long horizons.

Results & Findings

MetricBaseline (pre‑trained)Expert PolicyCODE‑SHARP (composed)
Success rate on 10‑step tasks42 %55 %71 %
Success rate on 30‑step tasks18 %27 %64 %
Average episode return (normalized)1.01.22.34
Sample efficiency (episodes to 80 % success)1.8 M2.3 M0.9 M
  • The agent trained only on generated rewards learned to solve novel, longer‑horizon goals that were never explicitly defined during training.
  • The FM planner could compose low‑level skills into high‑level plans that the single policy executed flawlessly, demonstrating emergent hierarchical behavior.
  • Ablation studies confirmed that both hierarchical structuring of the archive and LLM‑driven generation are essential; removing either drops performance by >30 %.

Practical Implications

  • Rapid Prototyping of Game AI – Developers can let an LLM suggest reward functions for new game mechanics, automatically expanding an agent’s repertoire without writing custom reward code.
  • Robotics & Automation – In domains where tasks evolve (e.g., warehouse robots encountering new object types), CODE‑SHARP can continuously generate and validate new task definitions.
  • Tooling for RL Researchers – The hierarchical archive offers a reusable library of reward programs that can be shared across projects, reducing duplication of effort.
  • Cost‑Effective Skill Scaling – Because only one goal‑conditioned policy is needed, memory and compute footprints stay low even as the skill set grows.
  • Explainability – Reward programs are human‑readable snippets of code, making it easier to audit what the agent is being asked to optimize.

Limitations & Future Work

  • Reliance on LLM Quality – The diversity and correctness of generated reward programs depend heavily on the underlying foundation model; biased or erroneous outputs can pollute the archive.
  • Scalability of the Archive – As the graph grows, searching for optimal skill compositions may become computationally expensive. Smarter indexing or pruning strategies are needed.
  • Domain Transfer – Experiments are limited to the Craftax sandbox; applying CODE‑SHARP to real‑world robotics or other simulation platforms may require additional safety checks and domain‑specific grounding.
  • Reward Mis‑specification – The framework assumes that the generated reward code aligns with the intended semantics. Future work could incorporate verification steps (e.g., formal methods) to catch mismatches.

Overall, CODE‑SHARP opens a promising path toward truly open‑ended skill discovery, turning large language models into collaborative designers of reinforcement‑learning objectives.

Authors

  • Pierluigi Vito Amadori
  • Richard Bornemann
  • Antoine Cully

Paper Information

ItemDetails
arXiv ID2602.10085v1
Categoriescs.AI
PublishedFebruary 10, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »