[Paper] ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization

Published: 1 day ago (March 10, 2026 at 03:19 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09290v1

Overview

ToolRosetta tackles a pain point that many developers know all too well: turning the massive, heterogeneous pool of open‑source code into reliable, plug‑and‑play services that can be called by large language model (LLM) agents. By automatically converting repositories and APIs into Model Context Protocol (MCP)‑compatible tools, the framework lets LLM‑driven agents assemble and run toolchains with almost no human curation, dramatically lowering the effort required to reuse existing code.

Key Contributions

Fully automated tool standardization – Transforms arbitrary open‑source projects into MCP services without manual wrappers.
End‑to‑end task planning – Given a natural‑language request, the system discovers relevant code, builds a toolchain, and executes it.
Built‑in security inspection – Static analysis and sandboxing guard against malicious or unsafe code execution.
Scalable evaluation – Demonstrates automatic standardization of thousands of tools across scientific, data‑processing, and engineering domains.
Performance boost for LLM agents – Shows consistent improvements over commercial LLMs and prior agent frameworks when leveraging the generated tools.

Methodology

Repository Mining – ToolRosetta crawls popular open‑source hosting platforms (e.g., GitHub, GitLab) and extracts candidate projects based on relevance to the user’s query.
Interface Extraction – Using static analysis and lightweight type inference, the system identifies entry points (functions, CLI commands, REST endpoints) that can be exposed as services.
MCP Wrapper Generation – For each entry point, a thin adapter is auto‑generated that conforms to the Model Context Protocol, handling input validation, serialization, and response formatting.
Security Layer – Before deployment, the code undergoes sandboxed execution, dependency vetting, and a set of rule‑based checks (e.g., network access, file system writes). Detected risks are either mitigated automatically or flagged for human review.
Task‑Driven Planning – An LLM receives the user’s natural‑language task, queries the internal tool registry, and composes a sequence of MCP calls (a “toolchain”) that can accomplish the goal. The plan is then executed step‑by‑step, with the LLM interpreting intermediate results and adjusting the plan if needed.

The whole pipeline is orchestrated by a lightweight orchestration engine that can spin up Docker containers or serverless functions on demand, making the generated services instantly callable.

Results & Findings

Metric	Baseline (manual tool curation)	ToolRosetta (auto)
Number of usable tools discovered per domain	~30–50	≈ 1,200 (≈ 25× increase)
Human effort (person‑hours) to make a tool MCP‑ready	2–4 h per tool	< 5 min (auto)
End‑to‑end task success rate (complex scientific pipelines)	62 %	81 %
Average task completion time	45 s	31 s (thanks to parallel tool execution)
Security incidents (sandbox violations)	0 (manual vetting)	0 (automated checks caught all issues)

Key takeaways: the automated pipeline not only scales the tool inventory dramatically but also translates into measurable gains in task success and latency. When the same tasks were handed to commercial LLM agents (e.g., GPT‑4‑based assistants) without ToolRosetta’s tool augmentation, success rates dropped by 12–18 %.

Practical Implications

Rapid prototyping – Developers can ask an LLM to “run a climate‑model calibration” and instantly receive a ready‑made toolchain that pulls the latest open‑source climate libraries, configures them, and executes the workflow.
Enterprise knowledge bases – Companies can ingest internal codebases, automatically expose them as MCP services, and let their internal LLM assistants orchestrate them without writing custom wrappers.
Reduced DevOps overhead – The sandboxed deployment model eliminates the need for separate CI pipelines for each third‑party tool; the framework handles containerization on the fly.
Security‑first integration – By embedding static analysis and runtime sandboxing, organizations can safely expose community code to production‑grade LLM agents, mitigating supply‑chain risks.
Ecosystem growth – ToolRosetta can serve as a “storefront” where any open‑source project becomes instantly discoverable and callable by AI agents, fostering a new marketplace of AI‑driven tool services.

Limitations & Future Work

Dependency complexity – Projects with heavy native dependencies (e.g., GPU‑accelerated libraries) still require manual environment tuning; the current sandbox cannot guarantee reproducibility for all such cases.
Semantic understanding – The static analysis may miss nuanced runtime requirements (e.g., specific configuration files), leading to occasional execution failures that need human debugging.
Scalability of security checks – While effective for the evaluated corpus, the rule‑based inspection may need to evolve to keep pace with novel attack vectors in larger, more diverse codebases.
Future directions – The authors plan to integrate dynamic profiling to better infer runtime constraints, extend support for container orchestration platforms (Kubernetes, Cloud Run), and explore reinforcement‑learning‑based planning to further improve toolchain selection.

ToolRosetta demonstrates that the bottleneck of “finding‑and‑wrapping” open‑source tools is solvable at scale, opening the door for LLM agents that can truly leverage the world’s code without a mountain of manual engineering.

Authors

Shimin Di
Xujie Yuan
Hanghui Guo
Chaoqian Ouyang
Zhangze Chen
Ling Yue
Libin Zheng
Jia Zhu
Shaowu Pan
Jian Yin
Min-Ling Zhang
Yong Rui

Paper Information

arXiv ID: 2603.09290v1
Categories: cs.SE, cs.CE, cs.MA
Published: March 10, 2026
PDF: Download PDF

[Paper] ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

[Paper] Preparing Students for AI-Driven Agile Development: A Project-Based AI Engineering Curriculum

[Paper] EmbC-Test: How to Speed Up Embedded Software Testing Using LLMs and RAG

[Paper] Towards Viewpoint-centric Artifact-based Regulatory Requirements Engineering for Compliance by Design