[Paper] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Published: 1 day ago (March 2, 2026 at 01:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02176v1

Overview

The paper introduces AgentSkillOS, a systematic framework for organizing, selecting, and orchestrating thousands of LLM‑driven “agent skills” (think plug‑in functions) at the scale of an entire ecosystem. By structuring skills in a hierarchical capability tree and chaining them together with directed‑acyclic‑graph (DAG) pipelines, the authors demonstrate that large‑scale skill collections can be used far more effectively than the ad‑hoc, flat “call‑any‑skill” approach that many current agents employ.

Key Contributions

Capability Tree: A recursive, node‑level categorization that turns an unstructured skill pool into a searchable tree, enabling fast discovery and near‑oracle retrieval.
DAG‑Based Orchestration: A pipeline model that composes multiple skills in a directed‑acyclic graph, allowing parallelism, data flow control, and conditional branching.
AgentSkillOS Benchmark: A new suite of 30 “artifact‑rich” tasks spanning data computation, document generation, motion video, visual design, and web interaction, together with an LLM‑powered pairwise evaluation pipeline (Bradley‑Terry aggregation).
Scalable Experiments: Empirical validation on ecosystems ranging from 200 to 200 K skills, showing that tree retrieval approximates an oracle selector and that DAG orchestration consistently outperforms flat skill invocation.
Open‑Source Release: Full code, benchmark data, and evaluation scripts are publicly available, encouraging reproducibility and community extensions.

Methodology

Skill Management (Stage 1)
- Each skill is annotated with a set of capability tags (e.g., image‑generation, SQL‑query, browser‑automation).
- A recursive clustering algorithm builds a capability tree where internal nodes represent broader concepts and leaves are individual skills.
- Retrieval works by traversing the tree from root to leaf, pruning branches that do not match the task’s semantic query, which yields a compact candidate set.
Task Solving (Stage 2)
- Given a user request, a lightweight LLM (the “orchestrator”) first selects a subset of relevant skills via the capability tree.
- The orchestrator then constructs a DAG pipeline: each node is a skill, edges encode data dependencies (e.g., output of a data‑cleaning skill feeds a visualization skill).
- The DAG is executed topologically, allowing parallel execution where possible and handling failures via fallback branches.
Benchmark & Evaluation
- 30 tasks were curated to require multiple, heterogeneous artifacts (tables, images, videos, web pages).
- For each task, three systems were compared: (a) Oracle (perfect skill selection), (b) Tree‑retrieval + DAG, and (c) Flat invocation (no structure).
- Outputs were judged pairwise by a strong LLM (GPT‑4‑Turbo) and scores were aggregated with a Bradley‑Terry model to produce a single quality metric per system.

Results & Findings

Ecosystem Size	Oracle vs. Tree Retrieval	Flat vs. DAG (same skill set)
200 skills	92 % of oracle quality	+18 % quality gain
2 K skills	89 % of oracle quality	+22 % quality gain
200 K skills	85 % of oracle quality	+27 % quality gain

Tree Retrieval consistently finds near‑optimal skill subsets, even as the catalog grows by three orders of magnitude.
DAG Orchestration yields substantially higher output quality than flat, sequential skill calls, confirming that structured composition unlocks latent capabilities.
The performance gap widens with larger skill pools, indicating that naive flat invocation becomes increasingly brittle at scale.

Practical Implications

Developer Tooling: Building a plug‑in marketplace (e.g., for Claude, ChatGPT, or internal LLM assistants) can adopt the capability‑tree index to provide instant, context‑aware skill suggestions.
Workflow Automation: Enterprises can define complex pipelines (data ETL → report → dashboard) as DAGs, letting the LLM automatically wire the right skills together without manual scripting.
Scalable AI Assistants: Products that aim to support “any‑thing‑as‑a‑skill” (e.g., AI‑powered IDEs, customer‑support bots) can maintain performance as the skill catalog balloons, avoiding the classic “search‑and‑call” slowdown.
Benchmarking Standards: The AgentSkillOS benchmark offers a reusable yardstick for future research on multi‑skill orchestration, encouraging more realistic, artifact‑centric evaluations.

Limitations & Future Work

Skill Metadata Quality: The tree’s effectiveness hinges on accurate capability tags; noisy or missing annotations can degrade retrieval.
Orchestrator LLM Size: Experiments used a strong LLM for DAG construction; lighter models may struggle with complex dependency reasoning.
Dynamic Skills: The current framework assumes a relatively static skill set; handling frequent additions/removals in real‑time remains an open challenge.
User‑Feedback Loop: Future work could incorporate reinforcement signals from end‑users to refine both the tree structure and DAG generation over time.

AgentSkillOS offers a concrete blueprint for turning a sprawling sea of LLM plug‑ins into a navigable, composable ecosystem—an essential step toward truly scalable AI assistants.

Authors

Hao Li
Chunjiang Mu
Jianhao Chen
Siyue Ren
Zhiyao Cui
Yiqun Zhang
Lei Bai
Shuyue Hu

Paper Information

arXiv ID: 2603.02176v1
Categories: cs.CL
Published: March 2, 2026
PDF: Download PDF

[Paper] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)