[Paper] Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Published: 3 days ago (February 16, 2026 at 11:10 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.14878v1

Overview

The paper investigates a surprisingly common problem in the emerging Model Context Protocol (MCP) ecosystem: the natural‑language descriptions that tell large language model (LLM) agents how to use external tools are often poorly written, or “smelly.” By systematically measuring these smells across hundreds of tools and testing how fixing them affects agent performance, the authors reveal a clear trade‑off between description quality, success rates, and execution cost.

Key Contributions

Empirical survey of MCP tool descriptions – Analyzed 856 tools from 103 MCP servers, the largest study of its kind.
Six‑component rubric – Defined a concrete set of description elements (purpose, inputs, outputs, constraints, examples, and error handling) and a scoring system to detect “smells.”
Automated FM‑based scanner – Built a language‑model‑driven tool that flags missing or ambiguous components in real time.
Impact assessment – Demonstrated that augmenting descriptions raises overall task success by a median 5.85 pp and partial goal completion by 15.12 %, but also adds ~67 % more execution steps.
Component ablation study – Showed that compact subsets of the six components can retain most reliability while cutting token usage and cost.
Open‑source artifacts – Released the scanner, the annotated dataset, and scripts for reproducibility.

Methodology

Data collection – Crawled public MCP servers (e.g., OpenAI Functions, LangChain, LlamaIndex) to gather tool definitions and their natural‑language descriptions.
Rubric design – Synthesized prior work on API documentation and prompt engineering to identify six essential description components. Each component received a binary “present/absent” score, summed to a 0‑6 quality rating.
Smell detection – Trained a small LLM (GPT‑3.5) to classify each description according to the rubric, turning the rubric into an automated scanner.
Experimental setup – Ran a suite of 30 benchmark tasks (question answering, data retrieval, code generation) using a baseline FM‑agent (GPT‑4) against the original tool set, then against an “augmented” set where missing components were added manually or via the scanner.
Metrics – Measured task success (binary), partial goal completion (percentage of sub‑tasks achieved), execution steps (number of tool calls), and token cost (proxy for monetary cost).
Ablation – Systematically removed individual components from the augmented descriptions to see which were most cost‑effective.

Results & Findings

Aspect	Original Descriptions	Augmented Descriptions
Smell prevalence	97.1 % of tools had ≥1 smell	0 % (by construction)
Purpose clarity	56 % failed to state purpose	100 % clear
Task success	Baseline median 71 %	Median +5.85 pp (≈77 %)
Partial goal completion	58 % average	+15.12 %
Execution steps	Avg. 8 calls per task	+67.46 % (≈13 calls)
Regressed cases	—	16.67 % of tasks performed worse
Token overhead	~1.2 k tokens per interaction	Up to 2.0 k tokens (depends on components)
Best‑performing compact combos	—	4‑component subsets (purpose + inputs + outputs + examples) kept >90 % of success gain while cutting token use by ~30 %

Takeaway: Adding missing description pieces generally helps the agent make better tool‑selection decisions, but the extra text consumes more of the LLM’s context window, leading to longer, costlier runs. Not all tasks benefit—some become slower or even less accurate, indicating context sensitivity.

Practical Implications

For MCP platform owners – Incorporate the provided scanner into CI pipelines to enforce description quality before publishing new tools. This can raise overall ecosystem reliability with minimal manual effort.
For developers building agents – When designing custom toolsets, prioritize clear purpose statements and concrete input/output schemas; these give the biggest ROI in success rate vs. token cost.
Cost‑aware prompt engineers – Use the ablation insights to trim descriptions to the most impactful components, preserving performance while staying within token budgets (critical for paid API usage).
Tool marketplaces – Adopt a “badge” system (e.g., “MCP‑Gold”) that signals a tool’s description passes the rubric, helping users quickly identify high‑quality integrations.
Automated debugging – The scanner can flag ambiguous descriptions that often cause “tool not found” or “argument mismatch” errors, reducing time spent on trial‑and‑error during agent development.

Limitations & Future Work

Model dependency – Experiments used GPT‑4; results may differ with smaller or open‑source LLMs that have tighter context windows.
Static benchmark – The 30 tasks cover common scenarios but may not reflect domain‑specific workloads (e.g., scientific computing, robotics).
Manual augmentation bias – Human‑crafted augmentations could unintentionally favor certain phrasing styles; future work should explore fully automated rewriting.
Dynamic tool evolution – MCP servers can add or modify tools at runtime; continuous monitoring and incremental scanning are needed.
User studies – The paper does not assess how developers perceive the added description overhead; qualitative feedback could guide UI/UX improvements for tool authoring interfaces.

Bottom line: Clean, well‑structured tool descriptions are a low‑hanging fruit for boosting LLM‑agent effectiveness, but developers must balance the added context cost against the performance gains. The authors provide both a diagnostic scanner and actionable guidelines to help the MCP community move toward more reliable, cost‑efficient AI agents.

Authors

Mohammed Mehedi Hasan
Hao Li
Gopi Krishnan Rajbahadur
Bram Adams
Ahmed E. Hassan

Paper Information

arXiv ID: 2602.14878v1
Categories: cs.SE, cs.ET
Published: February 16, 2026
PDF: Download PDF

[Paper] Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Software-heavy Asset Administration Shells: Classification and Use Cases

[Paper] Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification

[Paper] A Calculus of Overlays

[Paper] Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs