Why I Wouldn't Act on SkillsBench

Published: 2 months ago (February 25, 2026 at 05:37 AM EST)

7 min read

Source: Dev.to

Source: Dev.to

Overview

I came across SkillsBench (paper, Feb 2026) while watching Theo, and was genuinely excited. It asks two critical questions:

Do curated procedural documents (“Skills”) actually help coding agents?
Which coding agent utilizes them best?

The headline number – +16.2 pp from curated Skills – felt immediately actionable.

First Impressions

Then I started pulling at the methodology, and things unraveled.

Scope – 84 tasks, 11 domains, 7 coding agents, 7,308 trajectories.
Conditions – each task is evaluated under three settings:
- No Skills
- Curated (expert‑written) Skills
- Self‑generated Skills
Skill package – every task ships with a fixed Skill package (markdown instructions, sometimes with scripts or templates) that is provided to the agent alongside the task.

Leaderboard Findings

In every benchmark the central outcome is the leaderboard.

Finding 2 (§4.1.1) – Best Raw Performance

Agent	Model	Pass Rate
Gemini CLI + Flash	–	48.7 %
Claude Code + Opus 4.5	–	–
Uplift	–	+23.3 pp (largest)

This is a legitimate result — though Flash beating Opus 4.5/4.6 is a bit surprising.

What the Leaderboard Actually Measures

The leaderboard shows which agent performed the best, but it does not tell us whether the Skills mechanism made any difference, or whether the same result would have been achieved by placing the content directly in the prompt.

Missing Experiment

Inject the same Skill content directly into the prompt (baseline) vs. let the harness load Skills through its native discovery mechanism.

This experiment is absent from the paper, yet it is the one that would justify a benchmark titled “SkillsBench.”

Design‑Oriented Findings

Two of the paper’s design‑oriented findings sound practical:

Number of Skills – 2–3 Skills are optimal (+18.6 pp); 4+ Skills show diminishing returns (+5.9 pp). (Finding 5, §4.2.1)
Skill Length – Moderate‑length Skills outperform comprehensive ones — detailed (+18.8 pp) and compact (+17.1 pp) beat comprehensive (–2.9 pp). (Finding 6, §4.2.2)

Why These Findings Are Problematic

Each task ships with a fixed Skill package, so “Skill count” and “Skill complexity” are properties of the task, not independent variables.
Consequently, the experiment cannot isolate the effect of number of Skills from the effect of which task is being solved.
The paper stratifies post‑hoc by Skill count and draws causal language (“optimal,” “diminishing returns”), but the design does not support that inference.

The same issue applies to complexity: the N = 140 “comprehensive” bucket that shows –2.9 pp could simply contain harder tasks. Without controlling for task difficulty—or better, varying Skill count/complexity within a task—these are merely correlational observations dressed as design guidelines.

Domain Breakdown (Table 4)

Domain	Δ Pass Rate
Healthcare	+51.9 pp
Manufacturing	+41.9 pp
Software Engineering	+4.5 pp

These numbers anchor the paper’s claim that domains with knowledge “under‑represented in model pretraining” benefit most from Skills (Finding 4, §4.1.3).

Issues with the Domain Analysis

Healthcare – only 2 tasks.
Manufacturing – only 3 tasks.

A single outlier task—and several individual tasks swing by 70–85 pp—can dominate an entire domain’s aggregate. With such tiny N, you are not measuring a domain effect; you are measuring a handful of tasks. The paper reports these figures without confidence intervals at the domain level and without flagging the sample‑size limitation.

In contrast, Software Engineering (N = 16) shows a much more defensible estimate, albeit a far less exciting one.

Re‑framing the Findings as Prompting Results

Since Skills are lazily loaded prompt pieces, replace “Skills” with “prompt” in the remaining findings:

Original Finding	Re‑framed as Prompt
Finding 1 (§4.1.1) – curated Skills improve performance	Curated, expert‑written prompts improve performance.
Finding 7 (§4.2.3) – smaller model + Skills can exceed larger model without Skills	A smaller model with a good prompt can outperform a larger model with a mediocre prompt.
Finding 3 (§4.1.1) – self‑generated Skills provide no benefit	Performance doesn’t improve when the model provides its own context.

These re‑framings are not surprising; the prompting literature has already established both points.

Self‑Generated Skills

Finding 3 – self‑generated Skills provide no benefit – is slightly more interesting because meta‑prompting (using a model to generate its own prompts) does work in some settings.

A plausible explanation:

For tasks where the model lacks domain knowledge, it can’t write effective Skills because it lacks the knowledge.
For tasks where the model already has the domain knowledge, the marginal contribution of a Skill is minimal.

Either way, performance doesn’t improve when the model writes its own Skills, which translates to “performance doesn’t improve when the model provides its own context”—again, unsurprising.

What’s Missing & Suggested Experiments

The paper asks the right questions but doesn’t yet have the experiments to answer them. Below is a consolidated list of needed investigations.

1. Isolate the Mechanism

Goal: Determine whether the Skills machinery matters, not just whether the Skills content helps.
Experiment: Take the same Skill content and inject it directly into the prompt (baseline) vs. let the harness load it through its native discovery mechanism.
- If native loading wins → the Skills architecture is doing real work.
- If results are equivalent → Skills are merely a packaging format for prompt content.

2. Isolate the Content

Goal: Test whether procedural structure specifically drives gains.
Experiment: Inject the same token count of topically relevant non‑procedural text (e.g., API docs, reference material) and compare performance.

3. Vary Skills Within Tasks

Goal: Decouple Skill count/complexity from task difficulty.
Design: For a given task, create multiple Skill packages that differ only in number or length, keeping the underlying task constant.

4. Domain‑Level Robustness

Goal: Obtain reliable domain‑level estimates.
Design: Increase the number of tasks per domain (especially low‑N domains like Healthcare and Manufacturing) and report confidence intervals.

5. Baseline Prompt Quality

Goal: Ensure that any observed uplift isn’t simply due to better prompting.
Design: Compare curated Skills against expert‑crafted prompts of equivalent length and token budget.

Take‑away

SkillsBench raises important questions about the utility of curated procedural documents for coding agents.
The current methodology conflates content with mechanism, and it mixes correlational observations with causal language.
Rigorous, controlled experiments—especially those that directly compare the Skills loading mechanism to a plain‑prompt baseline—are needed before we can claim that “Skills” as a distinct benchmark component provide unique value.

Recommendations for Skill‑Based Agent Design

1. Separate Skill Design Findings from Task Identity

Run controlled experiments:
- Execute the same task multiple times, each time varying the number of Skills available.
- Measure the performance delta for each variation.
Vary Skill complexity:
- Provide the agent with a compact Skill set versus an exhaustive Skill set for the identical task.
- Observe how complexity influences outcomes.
Goal: Transform correlational observations into concrete design guidance.

2. Test with a Fixed Skill Library

Current setup limitation: Each task receives a hand‑picked Skill package, guaranteeing a perfect match.
Proposed experiment:
- Create a static library of ~20–30 Skills that remains constant across all tasks.
- Allow the agent to discover and apply the appropriate Skills on its own.
Why it matters:
- This evaluates Skill selection (the harder, more realistic problem) rather than merely Skill consumption.

3. Practical Recommendation

Do not act on the paper’s findings directly.
If you are investing in Skills for your agents today, calibrate that investment based on your own trial‑and‑error rather than relying on the study’s conclusions.

Why I Wouldn't Act on SkillsBench

Overview

First Impressions

Leaderboard Findings

Finding 2 (§4.1.1) – Best Raw Performance

What the Leaderboard Actually Measures

Missing Experiment

Design‑Oriented Findings

Why These Findings Are Problematic

Domain Breakdown (Table 4)

Issues with the Domain Analysis

Re‑framing the Findings as Prompting Results

Self‑Generated Skills

What’s Missing & Suggested Experiments

1. Isolate the Mechanism

2. Isolate the Content

3. Vary Skills Within Tasks

4. Domain‑Level Robustness

5. Baseline Prompt Quality

Take‑away

Recommendations for Skill‑Based Agent Design

1. Separate Skill Design Findings from Task Identity

2. Test with a Fixed Skill Library

3. Practical Recommendation

Related posts

Stop Queuing Inference Requests

The 3-Layer Architecture That Keeps My AI Business Running

Self-Hosting Remote VSCode with Cloudflare Tunnel and Authentik SSO

The AI Infrastructure Decision Matrix: Build vs. Buy in 2026

Overview

First Impressions

Leaderboard Findings

Finding 2 (§4.1.1) – Best Raw Performance

What the Leaderboard Actually Measures

Missing Experiment

Design‑Oriented Findings

Why These Findings Are Problematic

Domain Breakdown (Table 4)

Issues with the Domain Analysis

Re‑framing the Findings as Prompting Results

Self‑Generated Skills

What’s Missing & Suggested Experiments

1. Isolate the Mechanism

2. Isolate the Content

3. Vary Skills Within Tasks

4. Domain‑Level Robustness

5. Baseline Prompt Quality

Take‑away

Recommendations for Skill‑Based Agent Design

1. Separate Skill Design Findings from Task Identity

2. Test with a Fixed Skill Library

3. Practical Recommendation

Related posts

Stop Queuing Inference Requests

The 3-Layer Architecture That Keeps My AI Business Running

Self-Hosting Remote VSCode with Cloudflare Tunnel and Authentik SSO

The AI Infrastructure Decision Matrix: Build vs. Buy in 2026

Finding 2 (§4.1.1) – Best Raw Performance

Domain Breakdown (Table 4)