[Paper] Tool Calling is Linearly Readable and Steerable in Language Models
Source: arXiv - 2605.07990v1
Overview
The paper uncovers a surprisingly simple property of modern instruction‑tuned language models: the choice of which external tool to call is linearly readable and steerable inside the model’s hidden states. By probing and nudging internal activations, the authors can flip a model’s tool selection with near‑perfect accuracy, and even predict when a tool‑calling error is about to happen. This insight opens a new avenue for making AI agents more reliable and controllable.
Key Contributions
- Linear readout of tool identity – A single linear projection on hidden activations can recover the tool the model intends to use with 69‑82 % accuracy even in untuned base models.
- Linear steering of tool choice – Adding the mean‑difference vector between two tools’ activation centroids to the model’s hidden state forces the model to select the target tool in 77‑100 % of cases (≥ 93 % for models ≥ 4 B parameters).
- Error‑prediction signal – The gap between the top‑1 and top‑2 tool logits predicts failures: queries with a small gap are 14‑21× more likely to produce a wrong tool call.
- Localization of the effect – Activation‑patching isolates a handful of mid‑ and late‑layer attention heads and a single direction in the output‑layer row that drives the tool token, showing the phenomenon is not just a topic shift.
- Cross‑model and cross‑tool consistency – The linear readout works across 12 instruction‑tuned models (Gemma 3, Qwen 3/2.5, Llama 3.1) ranging from 270 M to 27 B parameters, and across 14 airline‑domain tools.
- Insight into pretraining vs. instruction‑tuning – Base models already encode the correct tool before generation (high cosine similarity), while instruction tuning aligns that latent representation with the actual output token.
Methodology
- Tool‑calling benchmark – Built a fixed‑menu, single‑turn prompt set where each query could be answered by one of several JSON‑schema tools (e.g., “search flight”, “book seat”).
- Probing – Collected hidden states (all layers) while the model processed the prompt. Trained a simple linear classifier (logistic regression) to predict the intended tool from these activations.
- Steering via activation injection – Computed the average activation vector for each tool (the “tool centroid”). To force a switch from tool A to tool B, added the difference Δ = mean_B – mean_A to the hidden state at a chosen layer and let the model continue generation.
- Activation patching – Replaced individual attention‑head outputs with those from a “correct” run and measured the impact on tool selection.
- Error‑gap analysis – Recorded the logit gap between the top two tool tokens; queries with small gaps were flagged as high‑risk.
- Cross‑model validation – Applied the same probes and steering vectors to 12 models of varying size and architecture to test robustness.
All steps rely on standard transformer internals (hidden states, attention heads, output logits) and require no gradient updates—only forward passes and simple vector arithmetic.
Results & Findings
| Model (size) | Linear readout accuracy (tool identity) | Steering success (name‑only prompt) |
|---|---|---|
| Gemma 3 12B | 71 % | 94 % |
| Gemma 3 27B | 78 % | 97 % |
| Llama 3.1 4B‑14B | 69‑82 % | 93‑100 % |
| Qwen 3 4B‑27B | 70‑84 % | 95‑100 % |
- Single‑direction control: Injecting a unit vector aligned with the output‑layer row for the target tool’s first token already achieves > 93 % steering, confirming that most of the effect lives in one direction.
- Attention‑head hotspots: Only 5‑8 mid‑to‑late‑layer heads need to be patched to reproduce the steering effect, suggesting a compact “tool‑selection circuit.”
- Error prediction: Queries where the top‑1/top‑2 logit gap falls in the lowest quartile produce wrong tool calls 14‑21× more often than those in the highest quartile.
- Base vs. tuned: Untuned base models encode the correct tool (high cosine similarity) but rarely emit it (2‑10 % generation accuracy). Instruction tuning aligns the latent representation with the output token, dramatically improving actual tool usage.
Overall, the study shows that tool selection is explicitly represented inside the model and can be read, edited, and monitored with minimal overhead.
Practical Implications
- Debug‑friendly agents – Developers can add a lightweight “watchdog” that reads the hidden state to verify the intended tool before execution, catching mismatches early and preventing costly mistakes (e.g., sending an email to the wrong recipient).
- Runtime steering – By injecting the appropriate Δ vector, a system can dynamically re‑route a request to a safer or more appropriate tool without re‑prompting or fine‑tuning the model. Useful for compliance (forcing a privacy‑preserving tool) or for A/B testing different tool implementations.
- Safety layers – The logit‑gap metric provides a cheap, on‑the‑fly confidence score for tool calls, enabling conditional fallbacks (e.g., ask the user for clarification when the gap is low).
- Model‑agnostic tooling – Since the phenomenon holds across architectures and scales, libraries can expose a generic API (
read_tool,steer_tool,tool_confidence) that works with any modern LLM backend. - Efficient fine‑tuning – Instead of full‑model RLHF for tool usage, a developer could fine‑tune only the small set of identified attention heads or add a linear adapter that directly maps the tool‑centroid vectors to the output token, saving compute and data.
- Pre‑training diagnostics – The fact that base models already encode tool identity suggests that pre‑training data quality (presence of tool‑like patterns) influences downstream tool reliability, guiding dataset curation for future LLMs.
Limitations & Future Work
- Single‑turn, fixed‑menu setting – Experiments focus on one‑shot prompts with a static list of tools. Multi‑turn conversations and dynamic tool discovery remain fragile and need deeper study.
- Scope of tools – Only airline‑domain JSON tools were evaluated; it is unclear how the findings generalize to more complex or hierarchical tool suites (e.g., code execution, database queries).
- Steering side‑effects – While the tool name flips cleanly, downstream JSON arguments automatically adapt to the new schema only because the prompt includes the schema. More ambiguous cases could produce malformed arguments.
- Interpretability depth – The identified attention heads are a promising entry point, but a full mechanistic model of how the “tool circuit” interacts with instruction tuning is still missing.
- Robustness to adversarial prompts – The linear steering technique could be abused to force a model into malicious tool usage; safeguards and detection mechanisms are an open research direction.
Future work may extend the probing/steering framework to multi‑turn agents, explore automated tool‑selection correction loops, and investigate how to embed similar linear control signals during pre‑training to produce inherently safer tool‑calling models.
Authors
- Zekun Wu
- Ze Wang
- Seonglae Cho
- Yufei Yang
- Adriano Koshiyama
- Sahan Bulathwela
- Maria Perez-Ortiz
Paper Information
- arXiv ID: 2605.07990v1
- Categories: cs.CL, cs.AI, cs.LG, cs.SE
- Published: May 8, 2026
- PDF: Download PDF