[Paper] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
Source: arXiv - 2603.02176v1
Overview
The paper introduces AgentSkillOS, a systematic framework for organizing, selecting, and orchestrating thousands of LLM‑driven “agent skills” (think plug‑in functions) at the scale of an entire ecosystem. By structuring skills in a hierarchical capability tree and chaining them together with directed‑acyclic‑graph (DAG) pipelines, the authors demonstrate that large‑scale skill collections can be used far more effectively than the ad‑hoc, flat “call‑any‑skill” approach that many current agents employ.
Key Contributions
- Capability Tree: A recursive, node‑level categorization that turns an unstructured skill pool into a searchable tree, enabling fast discovery and near‑oracle retrieval.
- DAG‑Based Orchestration: A pipeline model that composes multiple skills in a directed‑acyclic graph, allowing parallelism, data flow control, and conditional branching.
- AgentSkillOS Benchmark: A new suite of 30 “artifact‑rich” tasks spanning data computation, document generation, motion video, visual design, and web interaction, together with an LLM‑powered pairwise evaluation pipeline (Bradley‑Terry aggregation).
- Scalable Experiments: Empirical validation on ecosystems ranging from 200 to 200 K skills, showing that tree retrieval approximates an oracle selector and that DAG orchestration consistently outperforms flat skill invocation.
- Open‑Source Release: Full code, benchmark data, and evaluation scripts are publicly available, encouraging reproducibility and community extensions.
Methodology
-
Skill Management (Stage 1)
- Each skill is annotated with a set of capability tags (e.g., image‑generation, SQL‑query, browser‑automation).
- A recursive clustering algorithm builds a capability tree where internal nodes represent broader concepts and leaves are individual skills.
- Retrieval works by traversing the tree from root to leaf, pruning branches that do not match the task’s semantic query, which yields a compact candidate set.
-
Task Solving (Stage 2)
- Given a user request, a lightweight LLM (the “orchestrator”) first selects a subset of relevant skills via the capability tree.
- The orchestrator then constructs a DAG pipeline: each node is a skill, edges encode data dependencies (e.g., output of a data‑cleaning skill feeds a visualization skill).
- The DAG is executed topologically, allowing parallel execution where possible and handling failures via fallback branches.
-
Benchmark & Evaluation
- 30 tasks were curated to require multiple, heterogeneous artifacts (tables, images, videos, web pages).
- For each task, three systems were compared: (a) Oracle (perfect skill selection), (b) Tree‑retrieval + DAG, and (c) Flat invocation (no structure).
- Outputs were judged pairwise by a strong LLM (GPT‑4‑Turbo) and scores were aggregated with a Bradley‑Terry model to produce a single quality metric per system.
Results & Findings
| Ecosystem Size | Oracle vs. Tree Retrieval | Flat vs. DAG (same skill set) |
|---|---|---|
| 200 skills | 92 % of oracle quality | +18 % quality gain |
| 2 K skills | 89 % of oracle quality | +22 % quality gain |
| 200 K skills | 85 % of oracle quality | +27 % quality gain |
- Tree Retrieval consistently finds near‑optimal skill subsets, even as the catalog grows by three orders of magnitude.
- DAG Orchestration yields substantially higher output quality than flat, sequential skill calls, confirming that structured composition unlocks latent capabilities.
- The performance gap widens with larger skill pools, indicating that naive flat invocation becomes increasingly brittle at scale.
Practical Implications
- Developer Tooling: Building a plug‑in marketplace (e.g., for Claude, ChatGPT, or internal LLM assistants) can adopt the capability‑tree index to provide instant, context‑aware skill suggestions.
- Workflow Automation: Enterprises can define complex pipelines (data ETL → report → dashboard) as DAGs, letting the LLM automatically wire the right skills together without manual scripting.
- Scalable AI Assistants: Products that aim to support “any‑thing‑as‑a‑skill” (e.g., AI‑powered IDEs, customer‑support bots) can maintain performance as the skill catalog balloons, avoiding the classic “search‑and‑call” slowdown.
- Benchmarking Standards: The AgentSkillOS benchmark offers a reusable yardstick for future research on multi‑skill orchestration, encouraging more realistic, artifact‑centric evaluations.
Limitations & Future Work
- Skill Metadata Quality: The tree’s effectiveness hinges on accurate capability tags; noisy or missing annotations can degrade retrieval.
- Orchestrator LLM Size: Experiments used a strong LLM for DAG construction; lighter models may struggle with complex dependency reasoning.
- Dynamic Skills: The current framework assumes a relatively static skill set; handling frequent additions/removals in real‑time remains an open challenge.
- User‑Feedback Loop: Future work could incorporate reinforcement signals from end‑users to refine both the tree structure and DAG generation over time.
AgentSkillOS offers a concrete blueprint for turning a sprawling sea of LLM plug‑ins into a navigable, composable ecosystem—an essential step toward truly scalable AI assistants.
Authors
- Hao Li
- Chunjiang Mu
- Jianhao Chen
- Siyue Ren
- Zhiyao Cui
- Yiqun Zhang
- Lei Bai
- Shuyue Hu
Paper Information
- arXiv ID: 2603.02176v1
- Categories: cs.CL
- Published: March 2, 2026
- PDF: Download PDF