[Paper] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

Published: 2 months ago (December 10, 2025 at 01:26 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09897v1

Overview

The paper presents SCOPE, a new way to turn a large language model (LLM) into a “one‑time teacher” for hierarchical planning in purely textual environments. By extracting subgoals from the LLM just once—at the start of training—SCOPE pre‑trains a lightweight student planner that can operate without any further LLM calls, dramatically cutting compute cost while still beating the prior state‑of‑the‑art on the TextCraft benchmark.

Key Contributions

One‑Shot Subgoal Generation – Uses an LLM only at initialization to produce subgoals from example trajectories, eliminating repeated prompting during training and inference.
Subgoal‑Conditioned Pretraining (SCOPE) – Introduces a lightweight hierarchical planner that learns to follow the LLM‑generated subgoals, effectively distilling the LLM’s world knowledge into a compact model.
Efficiency Gains – Reduces inference latency from ~164 s (LLM‑based ADaPT) to ~3 s while achieving a higher success rate (0.56 vs 0.52).
Empirical Validation – Demonstrates that even suboptimal LLM‑generated subgoals provide a strong scaffold for hierarchical decomposition in the TextCraft text‑based planning environment.

Methodology

Collect Example Trajectories – Gather a modest set of successful (or partially successful) action sequences from the target text environment.
LLM Subgoal Extraction (One‑Shot) – Prompt a large pretrained LLM (e.g., GPT‑4) with each trajectory and ask it to break the sequence into high‑level subgoals (e.g., “collect wood”, “build a shelter”). This step runs only once.
Student Planner Architecture – Build a two‑level model:
- High‑Level Policy predicts which subgoal to pursue next, conditioned on the current textual observation.
- Low‑Level Policy executes primitive actions to achieve the chosen subgoal.
Subgoal‑Conditioned Pretraining – Train the student planner on the extracted subgoals using standard supervised learning (cross‑entropy for subgoal selection, imitation loss for low‑level actions). No LLM queries are needed after this stage.
Fine‑Tuning (Optional) – A short fine‑tuning phase on the target environment can further adapt the student without invoking the LLM again.

The overall pipeline is analogous to “teacher‑student” distillation, but the teacher’s guidance is provided a single time rather than repeatedly during learning.

Results & Findings

Metric	ADaPT (LLM‑based)	SCOPE
Success Rate (TextCraft)	0.52	0.56
Inference Time per Episode	164.4 s	3.0 s
Model Size (Student)	–	~30 M parameters (≈ 1 % of LLM)

Higher Success with Far Less Latency – SCOPE outperforms the prior hierarchical agent with a 55× speedup, making real‑time deployment feasible.
Robustness to Suboptimal Subgoals – Even though the LLM‑generated subgoals are not perfectly optimal, the student learns to compensate, indicating that the hierarchical scaffold is more valuable than exact optimality.
Scalability – Because the LLM is queried only once, the approach scales to larger datasets and more complex environments without a proportional increase in compute cost.

Practical Implications

Deployable Agents – Developers can embed the lightweight student planner into games, interactive fiction, or text‑based tutoring systems where latency and resource constraints matter.
Cost‑Effective Knowledge Transfer – Organizations can leverage expensive LLM APIs a single time to bootstrap domain‑specific planners, then run them entirely offline.
Rapid Prototyping – The one‑shot subgoal extraction pipeline can be scripted to work with any LLM provider, enabling quick iteration on new text environments without retraining large models.
Hybrid Systems – SCOPE’s architecture lends itself to a “fallback” design: use the student planner for most decisions, and only call the LLM in rare edge cases where the student’s confidence is low.

Limitations & Future Work

Explainability Trade‑off – Because subgoals are generated only once, developers cannot inspect or adjust them dynamically during training, limiting interpretability.
Subgoal Quality Dependency – The approach assumes that the LLM’s one‑shot subgoals are at least roughly sensible; highly noisy subgoals could degrade performance.
Domain Generalization – Experiments are confined to TextCraft; extending SCOPE to richer multimodal or embodied environments remains an open question.
Future Directions – The authors suggest exploring adaptive subgoal refinement (e.g., occasional LLM re‑queries) and applying SCOPE to code‑generation or API‑calling tasks where hierarchical planning is also critical.

Authors

Haoye Lu
Pavan Seshadri
Kaheer Suleman

Paper Information

arXiv ID: 2512.09897v1
Categories: cs.AI, cs.CL
Published: December 10, 2025
PDF: Download PDF

[Paper] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models