[Paper] Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments
Source: arXiv
Source: arXiv - 2602.16653v1
Overview
The paper “Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments” investigates whether the “Agent Skill” paradigm—already popular in large‑model ecosystems such as GitHub Copilot, LangChain, and OpenAI—can deliver the same gains when applied to small language models (SLMs).
This question is critical for enterprises that must:
- Keep data on‑premise
- Control costs
- Avoid vendor lock‑in
Key Contributions
- Formal definition of the Agent Skill process, turning the loosely‑described “skill selection + execution” loop into a mathematically grounded framework.
- Systematic benchmarking of models ranging from ~1 B to ~80 B parameters across three realistic tasks: two open‑source benchmarks and a proprietary insurance‑claims dataset.
- Empirical evidence that mid‑sized SLMs (≈12 B–30 B) gain the most from Agent Skills, while tiny models (<2 B) struggle with reliable skill selection.
- Demonstration that code‑specialized 80 B models match closed‑source baselines (e.g., GPT‑4) while using less GPU memory and offering better throughput.
- Actionable deployment guidelines for industry teams looking to replace costly API calls with on‑premise SLMs powered by the Agent Skill framework.
Methodology
Agent Skill Formalism – The authors model an agent as a tuple ⟨S, π, R⟩ where:
- S – set of reusable skills (e.g., “search database”, “run Python code”).
- π – policy (implemented by the language model) that selects a skill given the current context.
- R – reward signal (task success, reduced hallucination, latency).
Model Spectrum – Six open‑source models were evaluated:
- Tiny (1.3 B)
- Small (2.7 B)
- Mid (12 B, 30 B)
- Large code‑specialized (80 B)
Both general‑purpose and code‑oriented variants were included to test specialization effects.
Task Suite
- Open‑source Task 1: Multi‑step reasoning over a knowledge base (e.g., “retrieve‑then‑summarize”).
- Open‑source Task 2: Code generation with iterative debugging.
- Industrial Task: End‑to‑end processing of an insurance‑claims dataset (extract, validate, and route claims).
Evaluation Metrics – Accuracy, hallucination rate, skill‑selection latency, GPU memory consumption, and overall throughput (claims processed per hour).
Baseline Comparisons –
- Closed‑source agents (GPT‑4, Claude) accessed via API served as performance upper bounds.
- A naïve “single‑prompt” approach served as a lower bound.
Results & Findings
| Model Size | Skill‑Selection Success | End‑Task Accuracy | GPU Efficiency* |
|---|---|---|---|
| 1.3 B (tiny) | ~45 % (frequent mis‑selection) | 58 % | High (fits on 8 GB) |
| 2.7 B (small) | ~62 % | 66 % | Moderate |
| 12 B (mid) | 84 % | 78 % | Good |
| 30 B (mid‑large) | 88 % | 81 % | Good |
| 80 B (code‑specialized) | 92 % | 84 % (≈ closed‑source) | Better (≈30 % less memory than GPT‑4) |
*GPU efficiency measured as “claims processed per dollar of GPU time”.
- Tiny models often fail to pick the right skill, causing cascading errors.
- Mid‑sized SLMs gain a 15‑20 % boost in task accuracy over a single‑prompt baseline, with hallucinations dropping by ~30 %.
- Code‑specialized 80 B models reach parity with proprietary agents while using ~30 % less GPU memory, making them attractive for on‑prem deployments.
- Across all sizes, the Agent Skill loop reduces the number of required LLM calls per task by 2–3×, directly cutting latency and cost.
Practical Implications
Cost‑Effective Automation – Companies can replace expensive API calls with on‑premise SLMs (12 B–30 B) while retaining high accuracy, especially for multi‑step workflows such as claim triage, ticket routing, or data enrichment.
Security & Compliance – Keeping data in‑house eliminates the need to transmit sensitive information to external services, a key requirement for finance, healthcare, and insurance.
GPU Utilization – The demonstrated memory savings allow a single 80 B code model to run on one A100 GPU, whereas GPT‑4 would require multiple GPUs or a hosted service. This enables edge‑or‑on‑prem deployments in regulated environments.
Skill Library Reuse – The formal skill abstraction encourages teams to build reusable modules (e.g., “SQL query generator”, “PDF parser”) that any compatible SLM can invoke, accelerating internal tooling development.
Hybrid Stack – Organizations can adopt a tiered approach:
- Tiny models for low‑risk, high‑throughput tasks.
- Mid‑size SLMs for critical decision‑making.
- Large code‑specialized models for complex code generation or data‑transformation pipelines.
Limitations & Future Work
- Skill‑selection bottleneck – Even the best SLMs occasionally pick the wrong skill, especially when the context is ambiguous. A more robust external controller (e.g., an RL‑based policy) could improve reliability.
- Domain generalization – The insurance‑claims dataset is representative but still limited; results may differ for domains with richer multimodal data (images, audio).
- Evaluation scope – Only three tasks were examined; broader benchmarks (e.g., multimodal reasoning, real‑time chat) are needed to fully map the performance landscape.
- Hardware constraints – While 80 B models fit on a single A100, they still demand high‑end infrastructure; future work could explore quantization or distillation to bring comparable performance to smaller GPUs.
Overall, the study provides a clear roadmap for enterprises eager to harness the Agent‑Skill paradigm without relying on costly, closed‑source APIs, highlighting where small models shine and where they still need help.
Authors
- Yangjie Xu
- Lujun Li
- Lama Sleem
- Niccolò Gentile
- Yewei Song
- Yiqun Wang
- Siming Ji
- Wenbo Wu
- Radu State
Paper Information
| Field | Details |
|---|---|
| arXiv ID | 2602.16653v1 |
| Categories | cs.AI |
| Published | February 18, 2026 |
| Download PDF |