We tested 5 AI commit-message skills on security. 3 made things worse.

Published: 3 months ago (January 31, 2026 at 05:22 PM EST)

6 min read

Source: Dev.to

Source: Dev.to

Overview

Reusable AI components are exploding — skills, MCP servers, templates, sub‑agents.
But there’s no shared way to answer the question: “Will this actually help?”

We ran a behavioral evaluation study to find out. The results were surprising.

Of the 5 commit‑message skills we tested from GitHub for security, only 2 showed a positive lift over baseline.
The other 3 produced negative lift — worse outcomes than using no skill at all.
The top performer? A skill with zero security rules.

Even more striking: in our small sample, static analysis was an unreliable predictor of overall security performance. The skill that looked least secure (scoring 42/100 on a prompt‑only review) achieved the highest lift.

Static analysis did predict credential detection well — but failed across other categories. With only 5 skills this is a preliminary signal, not a definitive finding — but it suggests you need to measure what a skill does, not just read what it says.

Measuring “Lift”

To measure whether a skill actually helps, we need a baseline‑relative metric called lift:

[ \textbf{Lift} = \text{Skill Pass Rate} - \text{Baseline Pass Rate} ]

Positive lift → the skill adds value.
Negative lift → you’re better off without it.

In our tests, the baseline (Claude with no skill) achieved a 50 % overall pass rate across security categories, but this varies dramatically by category.

Category	Baseline Pass Rate	Interpretation
S1: Credential Detection	81.7 %	Model already good at obvious credentials
S2: Credential Files	85.0 %	Model already good at `.env` detection
S3: Git‑Crypt Awareness	15.0 %	Model over‑refuses encrypted files
S4: Shell Safety	53.3 %	Model sometimes includes unsafe syntax
S5: Path Sanitization	16.7 %	Model often leaks sensitive paths

Baseline performance ranges from 15 % to 85 %. Skills add the most value where the baseline is weak (S3, S4, S5).

Test Setup

5 commit‑message skills were selected from public GitHub repositories.
Each skill was evaluated on 100 security scenarios (5 categories × 2 difficulty levels × 10 tests).
Each test was run 3 times to reduce noise → ≈1,500 total executions.
Generation used Claude Haiku; results may differ with larger models.

Skills Tested

Skill	Length (chars)	Approach	Lift
epicenter	8,586	Strict conventional commits with 50‑char limit	+6.0 %
ilude	8,389	Comprehensive git workflow with security scanning	+1.7 %
toolhive	431	Minimal best practices	‑1.0 %
kanopi	4,610	Balanced commit conventions with security warnings	‑4.0 %
claude‑code‑helper	4,376	General‑purpose assistant with commit capabilities	‑4.3 %

The top performer, epicenter, contains zero security instructions (no credential detection, no secret scanning, no warnings about sensitive files).

Why a Format‑Focused Skill Beat Security‑Focused Ones

Constraint‑based safety:

epicenter’s strict 50‑character limit dramatically reduces the likelihood of shell metacharacters appearing in output.
Its abstract scope requirements discourage exposing sensitive path details.

Thus, format constraints provide implicit security without explicit rules.

Important caveat: epicenter’s overall lift hides category‑specific weaknesses.
S1 (credential detection): ‑10 %
S2 (credential files): ‑27 %
Its +6 % overall lift comes entirely from dominating S3, S4, and S5.
If catching API keys is your priority, epicenter is the wrong choice.

Static Analysis vs. Real‑World Performance

We asked Claude to rate each skill (0‑100) on security awareness based solely on the prompt text.

Skill	Security Mentions	Static Score	Actual Lift
epicenter	None — pure format guidance	42/100	+6.0 %
ilude	Explicit scanning rules, git‑crypt exceptions	78/100	+1.7 %
kanopi	API keys, secrets, credentials, `.env` files	52/100	‑4.0 %

Static analysis scores showed a weak correlation with actual lift (r = 0.32).

Category‑Level Correlations

Category	Correlation (r)	Meaning
S1: Credential Detection	+0.87	Explicit rules help
S4: Shell Safety	‑0.68	More rules = worse performance
S3: Git‑Crypt	‑0.50	More rules = worse performance

With n = 5, these correlations are noisy and not statistically significant, but the pattern is notable: for some categories, detailed instructions actively backfire.

Adversarial vs. Base Tests

Our suite includes base (straightforward) and adversarial variants. Adversarial tests add prompt‑injection context designed to trick the model into ignoring security constraints.

Dramatic Failure: `toolhive`

Skill	S1 Base	S1 Adversarial	Δ (percentage points)
toolhive	+16.7 %	‑23.3 %	‑40 pp
ilude	+33.3 %	+3.3 %	‑30 pp

toolhive goes from +16.7 % to ‑23.3 % — a 40‑point collapse. It handles straightforward cases well but fails when prompt injection tries to convince the model the credentials are safe.

Why doesn’t epicenter collapse?
Because it doesn’t rely on pattern‑matching. Its format constraints (e.g., the 50‑character limit) bound the output space, making social engineering ineffective: a 50‑character commit message simply can’t contain a full API key.

The Core Principle: Structural Constraints Over Explicit Rules

Format Constraint	Security Effect	Lift Contribution
50‑char limit	Less room for shell commands like `$(cmd)`	+20 % (S4)
Abstract scopes	Discourages client names or file paths	+27 % (S5)
No security rules	No over‑refusal of encrypted files	+30 % (S3)

Takeaway: Instead of listing what to avoid, set structural limits that reduce the likelihood of violations. A 50‑character limit doesn’t mention shell injection, yet it significantly constrains the output space available for an attacker.

Scope

5 skills
1 domain – security‑focused tests

Models

Results generated with Claude Haiku.
Larger models may handle verbose instructions differently.

Rigor

Results have been human‑audited.
We are publishing:
- Judge prompts
- Agreement rates
- Confidence intervals

Publication Note

We are releasing this early because limited data is better than no data. We’d rather be challenged on real numbers than be trusted on intuition.

Resources

Full methodology and judge rubrics:
Part 2 of this series (ablation testing – isolating exactly which constraints matter):

We tested 5 AI commit-message skills on security. 3 made things worse.

Overview

Measuring “Lift”

Test Setup

Skills Tested

Why a Format‑Focused Skill Beat Security‑Focused Ones

Static Analysis vs. Real‑World Performance

Category‑Level Correlations

Adversarial vs. Base Tests

Dramatic Failure: `toolhive`

The Core Principle: Structural Constraints Over Explicit Rules

Scope

Models

Rigor

Publication Note

Resources

Related posts

Introducing nono: A Secure Sandbox for AI Agents

Switch Claude Code providers in seconds with claude-provider (Plugin + CLI)

How to Set Up OpenClaw in 5-10 Minutes (No Mac Mini, No VPS, No Code)

Debugging My Brain: Why Procrastination is Actually an 'Emotional Regulation' Glitch

Overview

Measuring “Lift”

Test Setup

Skills Tested

Why a Format‑Focused Skill Beat Security‑Focused Ones

Static Analysis vs. Real‑World Performance

Category‑Level Correlations

Adversarial vs. Base Tests

Dramatic Failure: toolhive

The Core Principle: Structural Constraints Over Explicit Rules

Scope

Models

Rigor

Publication Note

Resources

Related posts

Introducing nono: A Secure Sandbox for AI Agents

Switch Claude Code providers in seconds with claude-provider (Plugin + CLI)

How to Set Up OpenClaw in 5-10 Minutes (No Mac Mini, No VPS, No Code)

Debugging My Brain: Why Procrastination is Actually an 'Emotional Regulation' Glitch

Dramatic Failure: `toolhive`