[Paper] Interactive LLM-assisted Curriculum Learning for Multi-Task Evolutionary Policy Search

Published: (February 11, 2026 at 09:21 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.10891v1

Overview

The paper introduces a novel framework that lets large language models (LLMs) act as interactive curriculum designers for multi‑task evolutionary policy search. By feeding the LLM real‑time feedback from the optimizer, the system can dynamically generate training scenarios that steadily push a robot’s policy toward better generalisation—something that previously required hand‑crafted curricula or static, offline LLM suggestions.

Key Contributions

  • Interactive LLM‑assisted curriculum generation – a loop where the LLM receives live metrics, plots, and visualisations from the evolutionary algorithm and instantly proposes new training cases.
  • Feedback‑modality study – systematic comparison of numeric‑only feedback vs. multimodal feedback (numeric + progression plots + behavior visualisations) on the LLM’s ability to craft useful curricula.
  • Empirical validation on a 2‑D robot navigation task – using genetic programming as the policy optimiser, the authors benchmark interactive curricula against static LLM‑generated and human‑expert curricula.
  • Performance parity with expert‑designed curricula – multimodal interactive feedback yields results that match or exceed those of hand‑crafted curricula, demonstrating that LLMs can approximate domain expertise.
  • Open‑ended design recipe – the framework is agnostic to the underlying optimisation algorithm, suggesting easy adaptation to other embodied‑AI or evolutionary‑robotics problems.

Methodology

  1. Problem setting – Multi‑task policy search via evolutionary algorithms (genetic programming) where each “task” is a navigation scenario in a 2‑D world.
  2. Curriculum loop
    • The evolutionary optimizer runs for a short interval and produces feedback (e.g., success rate, fitness curves, trajectory snapshots).
    • This feedback is packaged and sent to a large language model (e.g., GPT‑4).
    • The LLM, prompted with a description of the current policy performance and the desired learning goal, generates new training cases (obstacle layouts, start/goal positions, difficulty parameters).
    • The new cases are fed back into the optimizer, and the cycle repeats.
  3. Feedback modalities
    • Numeric only: raw scores and scalar metrics.
    • Numeric + plots: fitness curves, success‑rate over generations.
    • Numeric + plots + visualisations: trajectory videos or rendered snapshots of robot behaviour.
  4. Baselines
    • Static LLM curriculum: a one‑shot LLM generation before optimisation starts.
    • Expert curriculum: manually designed progression of tasks by a robotics researcher.
  5. Evaluation metrics – final success rate across a held‑out test set, learning speed (generations to reach a threshold), and curriculum “smoothness” (how gradually difficulty increases).

Results & Findings

Curriculum typeTest‑set success ↑Generations to 80 % success ↓Qualitative notes
Expert‑designed92 %45Smooth difficulty ramp, intuitive obstacles
Interactive (multimodal)90 %48LLM quickly learns to increase obstacle density after seeing failure patterns
Interactive (numeric‑only)78 %62Curriculum becomes erratic; LLM lacks visual context
Static LLM71 %70No adaptation to optimizer’s actual struggles
No curriculum (random tasks)55 %120Policy fails to generalise
  • Multimodal feedback (numbers + plots + visuals) gave the LLM enough context to propose curricula that are almost as effective as those crafted by human experts.
  • Numeric‑only feedback led to noisy curricula, confirming that visual cues are crucial for the LLM to understand the shape of the problem space.
  • The interactive loop consistently outperformed the static LLM baseline, highlighting the value of online adaptation.

Practical Implications

  • Rapid prototyping of training regimes – Developers can replace time‑consuming manual curriculum design with an LLM that tailors tasks on the fly, cutting down iteration cycles for embodied‑AI projects.
  • Scalable to diverse domains – Because the feedback is language‑agnostic, the same pattern can be applied to simulated drones, manipulators, or even non‑robotic optimisation problems (e.g., game‑level generation).
  • Lower barrier to entry – Small teams without deep domain expertise can achieve near‑expert performance by leveraging an LLM as a “curriculum consultant.”
  • Tooling opportunities – IDE‑style plugins could surface the LLM‑generated tasks directly in simulation environments (e.g., Unity, ROS Gazebo), allowing developers to inspect and approve curricula before deployment.
  • Cost‑effective training – By focusing the evolutionary search on progressively harder yet tractable tasks, compute budgets shrink, which is attractive for cloud‑based RL pipelines.

Limitations & Future Work

  • Domain specificity of prompts – The LLM still needs carefully crafted prompts and a well‑structured feedback format; a generic “plug‑and‑play” solution is not yet available.
  • Scalability to high‑dimensional tasks – The study used a simple 2‑D navigation benchmark; it remains unclear how the approach scales to 3‑D robotics or tasks with richer sensory inputs.
  • Reliance on visualisation quality – Poorly rendered trajectories can mislead the LLM; robust visual pipelines are required.
  • Potential for hallucination – The LLM may suggest impossible or unsafe scenarios; a verification layer is needed before feeding tasks to the optimizer.
  • Future directions suggested by the authors include: extending the framework to other evolutionary algorithms (CMA‑ES, NEAT), testing on real‑world robots, and exploring reinforcement‑learning‑style reward shaping as an additional feedback channel.

Authors

  • Berfin Sakallioglu
  • Giorgia Nadizar
  • Eric Medvet

Paper Information

  • arXiv ID: 2602.10891v1
  • Categories: cs.NE, cs.AI
  • Published: February 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »