[Paper] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

Published: (December 10, 2025 at 01:26 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09897v1

Overview

The paper presents SCOPE, a new way to turn a large language model (LLM) into a “one‑time teacher” for hierarchical planning in purely textual environments. By extracting subgoals from the LLM just once—at the start of training—SCOPE pre‑trains a lightweight student planner that can operate without any further LLM calls, dramatically cutting compute cost while still beating the prior state‑of‑the‑art on the TextCraft benchmark.

Key Contributions

  • One‑Shot Subgoal Generation – Uses an LLM only at initialization to produce subgoals from example trajectories, eliminating repeated prompting during training and inference.
  • Subgoal‑Conditioned Pretraining (SCOPE) – Introduces a lightweight hierarchical planner that learns to follow the LLM‑generated subgoals, effectively distilling the LLM’s world knowledge into a compact model.
  • Efficiency Gains – Reduces inference latency from ~164 s (LLM‑based ADaPT) to ~3 s while achieving a higher success rate (0.56 vs 0.52).
  • Empirical Validation – Demonstrates that even suboptimal LLM‑generated subgoals provide a strong scaffold for hierarchical decomposition in the TextCraft text‑based planning environment.

Methodology

  1. Collect Example Trajectories – Gather a modest set of successful (or partially successful) action sequences from the target text environment.
  2. LLM Subgoal Extraction (One‑Shot) – Prompt a large pretrained LLM (e.g., GPT‑4) with each trajectory and ask it to break the sequence into high‑level subgoals (e.g., “collect wood”, “build a shelter”). This step runs only once.
  3. Student Planner Architecture – Build a two‑level model:
    • High‑Level Policy predicts which subgoal to pursue next, conditioned on the current textual observation.
    • Low‑Level Policy executes primitive actions to achieve the chosen subgoal.
  4. Subgoal‑Conditioned Pretraining – Train the student planner on the extracted subgoals using standard supervised learning (cross‑entropy for subgoal selection, imitation loss for low‑level actions). No LLM queries are needed after this stage.
  5. Fine‑Tuning (Optional) – A short fine‑tuning phase on the target environment can further adapt the student without invoking the LLM again.

The overall pipeline is analogous to “teacher‑student” distillation, but the teacher’s guidance is provided a single time rather than repeatedly during learning.

Results & Findings

MetricADaPT (LLM‑based)SCOPE
Success Rate (TextCraft)0.520.56
Inference Time per Episode164.4 s3.0 s
Model Size (Student)~30 M parameters (≈ 1 % of LLM)
  • Higher Success with Far Less Latency – SCOPE outperforms the prior hierarchical agent with a 55× speedup, making real‑time deployment feasible.
  • Robustness to Suboptimal Subgoals – Even though the LLM‑generated subgoals are not perfectly optimal, the student learns to compensate, indicating that the hierarchical scaffold is more valuable than exact optimality.
  • Scalability – Because the LLM is queried only once, the approach scales to larger datasets and more complex environments without a proportional increase in compute cost.

Practical Implications

  • Deployable Agents – Developers can embed the lightweight student planner into games, interactive fiction, or text‑based tutoring systems where latency and resource constraints matter.
  • Cost‑Effective Knowledge Transfer – Organizations can leverage expensive LLM APIs a single time to bootstrap domain‑specific planners, then run them entirely offline.
  • Rapid Prototyping – The one‑shot subgoal extraction pipeline can be scripted to work with any LLM provider, enabling quick iteration on new text environments without retraining large models.
  • Hybrid Systems – SCOPE’s architecture lends itself to a “fallback” design: use the student planner for most decisions, and only call the LLM in rare edge cases where the student’s confidence is low.

Limitations & Future Work

  • Explainability Trade‑off – Because subgoals are generated only once, developers cannot inspect or adjust them dynamically during training, limiting interpretability.
  • Subgoal Quality Dependency – The approach assumes that the LLM’s one‑shot subgoals are at least roughly sensible; highly noisy subgoals could degrade performance.
  • Domain Generalization – Experiments are confined to TextCraft; extending SCOPE to richer multimodal or embodied environments remains an open question.
  • Future Directions – The authors suggest exploring adaptive subgoal refinement (e.g., occasional LLM re‑queries) and applying SCOPE to code‑generation or API‑calling tasks where hierarchical planning is also critical.

Authors

  • Haoye Lu
  • Pavan Seshadri
  • Kaheer Suleman

Paper Information

  • arXiv ID: 2512.09897v1
  • Categories: cs.AI, cs.CL
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »