[Paper] MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
Source: arXiv - 2601.05215v1
Overview
The paper introduces MineNPC-Task, a benchmark suite that lets researchers and developers evaluate large‑language‑model (LLM) agents that need to remember and act in an open‑world environment—Minecraft. By turning real player‑driven quests into structured, machine‑checkable tasks, the authors provide a reproducible way to measure how well “memory‑aware” agents plan, act, and recover from mistakes.
Key Contributions
- User‑authored, real‑world tasks – Derived from co‑play sessions with expert Minecraft players, then distilled into parametric templates with explicit preconditions and dependencies.
- Mixed‑initiative evaluation harness – Captures a rich event log (plan previews, clarification requests, memory reads/writes, precondition checks, repair attempts) and scores agents against in‑world evidence rather than synthetic prompts.
- Bounded‑knowledge policy – Agents are prohibited from using “out‑of‑world” shortcuts; all information must come from the agent’s own memory or the environment.
- Comprehensive validation suite – Machine‑checkable validators automatically verify each subtask’s success, enabling large‑scale, reproducible testing.
- Empirical baseline – Evaluates GPT‑4o on 216 subtasks across 8 experienced players, exposing systematic failure modes and the benefits of mixed‑initiative clarifications.
- Open‑source release – Full task definitions, validators, logs, and the harness are publicly available for the community.
Methodology
- Task Collection – Researchers played Minecraft with expert players, recording natural quests (e.g., “craft a beacon”, “navigate to a hidden cave”).
- Template Normalization – Each quest is abstracted into a parametric template (variables for items, locations, etc.) with a clear precondition graph that defines ordering and dependencies.
- Agent Interface – Agents interact through a text‑based console that supports:
- Plan previews (the agent’s intended sequence of actions)
- Clarification queries (agent asks the human for missing info)
- Memory ops (read/write to a lightweight episodic store)
- Bounded‑Knowledge Enforcement – The harness blocks any attempt to “cheat” by pulling external data; agents must rely on their internal memory or observations from the Minecraft world.
- Validation – For each subtask, a validator inspects the game state (inventory, player position, block changes) to decide success or failure, producing a numeric score (successful subtasks / attempted subtasks).
- Human Rating – Players rated interaction quality and UI usability on Likert scales, providing qualitative feedback on the mixed‑initiative experience.
Results & Findings
- Overall Performance – GPT‑4o completed ≈ 62 % of the 216 subtasks successfully.
- Common Failure Modes
- Code execution errors (e.g., malformed command strings)
- Inventory mis‑management (dropping needed items, forgetting to craft intermediate tools)
- Reference errors (confusing similarly named objects or locations)
- Navigation glitches (getting stuck on terrain or taking sub‑optimal routes)
- Recovery via Clarifications – When the agent asked for clarification, success rates rose to ≈ 78 % for those subtasks, highlighting the value of mixed‑initiative dialogue.
- Memory Persistence Gap – Participants noted that the agent often “forgot” facts learned early in a session, leading to repeated clarification requests.
- User Experience – Interaction quality scored 4.2/5 and UI usability 4.0/5, indicating that the console‑based interface is approachable for seasoned Minecraft players.
Practical Implications
- Benchmark for Embodied AI – MineNPC-Task gives developers a concrete, reproducible yardstick for testing memory‑augmented agents before deploying them in games, simulations, or robotics.
- Designing Better Agent Memory – The observed forgetting patterns suggest that future agents need persistent, hierarchical memory structures (e.g., long‑term world models plus short‑term task buffers).
- Mixed‑Initiative Interfaces – Incorporating clarification dialogs can dramatically improve reliability, encouraging UI designs that let agents ask “why?” or “what’s the exact block type?” in real time.
- Safety via Bounded Knowledge – Enforcing no‑cheat policies ensures agents learn to rely on perception and memory, a principle useful for safety‑critical embodied systems (e.g., warehouse robots).
- Rapid Prototyping – Because the task suite is parametric, developers can generate new quests on the fly, enabling continuous integration testing for LLM‑driven bots in sandbox environments.
Limitations & Future Work
- Single LLM Baseline – The study only evaluates GPT‑4o; results may differ with other model families or smaller parameter counts.
- Minecraft‑Specific Domain – While the benchmark is rich, its findings may not directly transfer to non‑voxel or non‑sandbox domains without adaptation.
- Memory Model Simplicity – The current lightweight memory store lacks hierarchical or forgetting mechanisms, which the authors identify as a key area for improvement.
- Scalability of Human Validation – Although validators are automated, the initial task authoring still relies on expert players; scaling to broader task libraries will need crowdsourced or synthetic generation pipelines.
The authors invite the community to extend the suite, plug in alternative memory architectures, and explore richer mixed‑initiative protocols—setting the stage for more capable, memory‑aware embodied agents.
Authors
- Tamil Sudaravan Mohan Doss
- Michael Xu
- Sudha Rao
- Andrew D. Wilson
- Balasaravanan Thoravi Kumaravel
Paper Information
- arXiv ID: 2601.05215v1
- Categories: cs.AI
- Published: January 8, 2026
- PDF: Download PDF