[Paper] A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Published: 3 days ago (February 9, 2026 at 01:00 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.08964v1

Overview

The paper introduces a systematic framework for judging whether a language‑model (LLM)‑based agent is truly goal‑directed. By combining classic behavioural tests with probing of the model’s internal “thoughts,” the authors show how to tell if an LLM not only acts like it’s pursuing a goal but also represents that goal internally. Their case study—an LLM navigating a 2‑D grid world—offers concrete evidence that both external performance and internal representations matter when evaluating autonomous agents.

Key Contributions

Unified evaluation framework that blends behavioural metrics (success rate, optimality) with interpretability probes of hidden states.
Comprehensive behavioural benchmark across grid sizes, obstacle densities, and varied goal structures, demonstrating robustness to difficulty‑preserving transformations.
Probing methodology that decodes spatial maps and multi‑step plans from the LLM’s hidden layers, revealing how the model internally tracks position, goal location, and action intent.
Empirical finding that the LLM builds a coarse, non‑linear spatial map and dynamically reorganises it as reasoning proceeds—from global layout cues to immediate action cues.
Argument for introspection: purely behavioural tests can miss mis‑aligned internal representations; probing is essential for trustworthy agent design.

Methodology

Task Setup – An LLM is prompted to act as an agent in a deterministic 2‑D grid world. The goal is to reach a target cell while avoiding obstacles.
Behavioural Evaluation – The agent’s trajectory is compared against an optimal policy (computed via A* search). Experiments vary:
- Grid dimensions (e.g., 5×5 up to 30×30)
- Obstacle density (sparse to dense)
- Goal complexity (single target vs. multi‑step sub‑goals)
- Difficulty‑preserving transformations (e.g., rotating the map) to test invariance.
Representation Probing – Hidden activations from each transformer layer are extracted after every reasoning step. Linear and non‑linear probes are trained to predict:
- Current (x, y) position of the agent
- Goal coordinates
- Planned next‑step action sequence (up to 3 steps ahead)
  Probes are evaluated on held‑out grids to ensure they capture genuine internal coding rather than overfitting.
Analysis of Dynamics – By tracking probe accuracy over the reasoning chain, the authors observe how the model’s internal map shifts focus from global layout to immediate action cues.

Results & Findings

Behavioural robustness: Success rates stay high (>85 %) on moderate‑size grids and degrade gracefully as difficulty rises, matching the optimal policy’s performance curve. Rotations and reflections of the grid do not significantly affect outcomes, indicating learned invariance.
Internal spatial map: Probes can recover the agent’s position and goal location with ~70 % accuracy from mid‑to‑late transformer layers, despite the map being encoded non‑linearly. Early layers contain weaker spatial signals.
Action‑plan alignment: Predicted next‑step actions from the probes align with the actual actions taken >80 % of the time, confirming that the LLM’s hidden states are not just “black‑box” but actively guide decision‑making.
Reasoning dynamics: As the model reasons, probe performance for global cues (full map) drops while local cue accuracy (next action) rises, suggesting a re‑allocation of representational bandwidth toward immediate execution.
Goal complexity: When the goal requires multi‑step sub‑goals (e.g., collect a key before opening a door), the LLM still forms a usable map but probe accuracy for the final goal drops, hinting at limits in hierarchical planning.

Practical Implications

Debugging LLM agents: Developers can now attach probing heads to monitor whether an agent truly “knows” where it is and where it’s headed, catching hidden failures before deployment.
Safety & alignment: By verifying that internal representations align with intended objectives, teams can reduce the risk of goal‑drift in autonomous systems (e.g., robotics, game AI, or workflow automation).
Design of prompting strategies: The findings suggest that prompting LLMs to explicitly request “state summaries” could reinforce internal map formation, leading to more reliable navigation or planning behaviours.
Benchmarking standards: The combined behavioural + introspection suite can become a new baseline for evaluating any LLM‑driven agent, from virtual assistants to self‑optimising code generators.
Transfer to other domains: The probing approach is domain‑agnostic; similar techniques could assess LLMs handling code execution, database queries, or network routing, where an internal notion of “state” is crucial.

Limitations & Future Work

Scale of the environment: Experiments stop at ~30×30 grids; it remains unclear how the approach scales to larger, more complex worlds or continuous spaces.
Probe interpretability: Probes are trained classifiers; they reveal what can be decoded but not how the model arrives at those encodings, leaving a gap in causal understanding.
Single‑model focus: The study uses one LLM architecture (a decoder‑only transformer). Generalising to encoder‑decoder models, retrieval‑augmented systems, or multimodal agents needs further work.
Hierarchical planning: Performance drops on multi‑step sub‑goal tasks, indicating that current LLMs may struggle with deep hierarchical reasoning—future research could integrate external planners or memory modules.
Real‑world transfer: Moving from a simulated grid to physical robots or interactive software agents will introduce noise, partial observability, and timing constraints that the current framework does not address.

Bottom line: By marrying behavioural tests with internal‑state probing, this work gives developers a practical toolbox for certifying that LLM agents are not just acting goal‑directed, but also thinking about their goals in a way we can observe and trust.

Authors

Raghu Arghal
Fade Chen
Niall Dalton
Evgenii Kortukov
Calum McNamara
Angelos Nalmpantis
Moksh Nirvaan
Gabriele Sarti
Mario Giulianelli

Paper Information

arXiv ID: 2602.08964v1
Categories: cs.LG, cs.AI, cs.CL, cs.CY
Published: February 9, 2026
PDF: Download PDF

[Paper] A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] Weight Decay Improves Language Model Plasticity

[Paper] Just on Time: Token-Level Early Stopping for Diffusion Language Models

[Paper] GameDevBench: Evaluating Agentic Capabilities Through Game Development