[Paper] Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Published: (February 16, 2026 at 11:10 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.14878v1

Overview

The paper investigates a surprisingly common problem in the emerging Model Context Protocol (MCP) ecosystem: the natural‑language descriptions that tell large language model (LLM) agents how to use external tools are often poorly written, or “smelly.” By systematically measuring these smells across hundreds of tools and testing how fixing them affects agent performance, the authors reveal a clear trade‑off between description quality, success rates, and execution cost.

Key Contributions

  • Empirical survey of MCP tool descriptions – Analyzed 856 tools from 103 MCP servers, the largest study of its kind.
  • Six‑component rubric – Defined a concrete set of description elements (purpose, inputs, outputs, constraints, examples, and error handling) and a scoring system to detect “smells.”
  • Automated FM‑based scanner – Built a language‑model‑driven tool that flags missing or ambiguous components in real time.
  • Impact assessment – Demonstrated that augmenting descriptions raises overall task success by a median 5.85 pp and partial goal completion by 15.12 %, but also adds ~67 % more execution steps.
  • Component ablation study – Showed that compact subsets of the six components can retain most reliability while cutting token usage and cost.
  • Open‑source artifacts – Released the scanner, the annotated dataset, and scripts for reproducibility.

Methodology

  1. Data collection – Crawled public MCP servers (e.g., OpenAI Functions, LangChain, LlamaIndex) to gather tool definitions and their natural‑language descriptions.
  2. Rubric design – Synthesized prior work on API documentation and prompt engineering to identify six essential description components. Each component received a binary “present/absent” score, summed to a 0‑6 quality rating.
  3. Smell detection – Trained a small LLM (GPT‑3.5) to classify each description according to the rubric, turning the rubric into an automated scanner.
  4. Experimental setup – Ran a suite of 30 benchmark tasks (question answering, data retrieval, code generation) using a baseline FM‑agent (GPT‑4) against the original tool set, then against an “augmented” set where missing components were added manually or via the scanner.
  5. Metrics – Measured task success (binary), partial goal completion (percentage of sub‑tasks achieved), execution steps (number of tool calls), and token cost (proxy for monetary cost).
  6. Ablation – Systematically removed individual components from the augmented descriptions to see which were most cost‑effective.

Results & Findings

AspectOriginal DescriptionsAugmented Descriptions
Smell prevalence97.1 % of tools had ≥1 smell0 % (by construction)
Purpose clarity56 % failed to state purpose100 % clear
Task successBaseline median 71 %Median +5.85 pp (≈77 %)
Partial goal completion58 % average+15.12 %
Execution stepsAvg. 8 calls per task+67.46 % (≈13 calls)
Regressed cases16.67 % of tasks performed worse
Token overhead~1.2 k tokens per interactionUp to 2.0 k tokens (depends on components)
Best‑performing compact combos4‑component subsets (purpose + inputs + outputs + examples) kept >90 % of success gain while cutting token use by ~30 %

Takeaway: Adding missing description pieces generally helps the agent make better tool‑selection decisions, but the extra text consumes more of the LLM’s context window, leading to longer, costlier runs. Not all tasks benefit—some become slower or even less accurate, indicating context sensitivity.

Practical Implications

  • For MCP platform owners – Incorporate the provided scanner into CI pipelines to enforce description quality before publishing new tools. This can raise overall ecosystem reliability with minimal manual effort.
  • For developers building agents – When designing custom toolsets, prioritize clear purpose statements and concrete input/output schemas; these give the biggest ROI in success rate vs. token cost.
  • Cost‑aware prompt engineers – Use the ablation insights to trim descriptions to the most impactful components, preserving performance while staying within token budgets (critical for paid API usage).
  • Tool marketplaces – Adopt a “badge” system (e.g., “MCP‑Gold”) that signals a tool’s description passes the rubric, helping users quickly identify high‑quality integrations.
  • Automated debugging – The scanner can flag ambiguous descriptions that often cause “tool not found” or “argument mismatch” errors, reducing time spent on trial‑and‑error during agent development.

Limitations & Future Work

  • Model dependency – Experiments used GPT‑4; results may differ with smaller or open‑source LLMs that have tighter context windows.
  • Static benchmark – The 30 tasks cover common scenarios but may not reflect domain‑specific workloads (e.g., scientific computing, robotics).
  • Manual augmentation bias – Human‑crafted augmentations could unintentionally favor certain phrasing styles; future work should explore fully automated rewriting.
  • Dynamic tool evolution – MCP servers can add or modify tools at runtime; continuous monitoring and incremental scanning are needed.
  • User studies – The paper does not assess how developers perceive the added description overhead; qualitative feedback could guide UI/UX improvements for tool authoring interfaces.

Bottom line: Clean, well‑structured tool descriptions are a low‑hanging fruit for boosting LLM‑agent effectiveness, but developers must balance the added context cost against the performance gains. The authors provide both a diagnostic scanner and actionable guidelines to help the MCP community move toward more reliable, cost‑efficient AI agents.

Authors

  • Mohammed Mehedi Hasan
  • Hao Li
  • Gopi Krishnan Rajbahadur
  • Bram Adams
  • Ahmed E. Hassan

Paper Information

  • arXiv ID: 2602.14878v1
  • Categories: cs.SE, cs.ET
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Calculus of Overlays

Just as the λ-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, Overlay-Calculus uses three primi...