We tested 5 AI commit-message skills on security. 3 made things worse.
Source: Dev.to
Overview
Reusable AI components are exploding — skills, MCP servers, templates, sub‑agents.
But there’s no shared way to answer the question: “Will this actually help?”
We ran a behavioral evaluation study to find out. The results were surprising.
- Of the 5 commit‑message skills we tested from GitHub for security, only 2 showed a positive lift over baseline.
- The other 3 produced negative lift — worse outcomes than using no skill at all.
- The top performer? A skill with zero security rules.
Even more striking: in our small sample, static analysis was an unreliable predictor of overall security performance. The skill that looked least secure (scoring 42/100 on a prompt‑only review) achieved the highest lift.
Static analysis did predict credential detection well — but failed across other categories. With only 5 skills this is a preliminary signal, not a definitive finding — but it suggests you need to measure what a skill does, not just read what it says.
Measuring “Lift”
To measure whether a skill actually helps, we need a baseline‑relative metric called lift:
[ \textbf{Lift} = \text{Skill Pass Rate} - \text{Baseline Pass Rate} ]
- Positive lift → the skill adds value.
- Negative lift → you’re better off without it.
In our tests, the baseline (Claude with no skill) achieved a 50 % overall pass rate across security categories, but this varies dramatically by category.
| Category | Baseline Pass Rate | Interpretation |
|---|---|---|
| S1: Credential Detection | 81.7 % | Model already good at obvious credentials |
| S2: Credential Files | 85.0 % | Model already good at .env detection |
| S3: Git‑Crypt Awareness | 15.0 % | Model over‑refuses encrypted files |
| S4: Shell Safety | 53.3 % | Model sometimes includes unsafe syntax |
| S5: Path Sanitization | 16.7 % | Model often leaks sensitive paths |
Baseline performance ranges from 15 % to 85 %. Skills add the most value where the baseline is weak (S3, S4, S5).
Test Setup
- 5 commit‑message skills were selected from public GitHub repositories.
- Each skill was evaluated on 100 security scenarios (5 categories × 2 difficulty levels × 10 tests).
- Each test was run 3 times to reduce noise → ≈1,500 total executions.
- Generation used Claude Haiku; results may differ with larger models.
Skills Tested
| Skill | Length (chars) | Approach | Lift |
|---|---|---|---|
| epicenter | 8,586 | Strict conventional commits with 50‑char limit | +6.0 % |
| ilude | 8,389 | Comprehensive git workflow with security scanning | +1.7 % |
| toolhive | 431 | Minimal best practices | ‑1.0 % |
| kanopi | 4,610 | Balanced commit conventions with security warnings | ‑4.0 % |
| claude‑code‑helper | 4,376 | General‑purpose assistant with commit capabilities | ‑4.3 % |
The top performer, epicenter, contains zero security instructions (no credential detection, no secret scanning, no warnings about sensitive files).
Why a Format‑Focused Skill Beat Security‑Focused Ones
Constraint‑based safety:
- epicenter’s strict 50‑character limit dramatically reduces the likelihood of shell metacharacters appearing in output.
- Its abstract scope requirements discourage exposing sensitive path details.
Thus, format constraints provide implicit security without explicit rules.
Important caveat: epicenter’s overall lift hides category‑specific weaknesses.
- S1 (credential detection): ‑10 %
- S2 (credential files): ‑27 %
- Its +6 % overall lift comes entirely from dominating S3, S4, and S5.
If catching API keys is your priority, epicenter is the wrong choice.
Static Analysis vs. Real‑World Performance
We asked Claude to rate each skill (0‑100) on security awareness based solely on the prompt text.
| Skill | Security Mentions | Static Score | Actual Lift |
|---|---|---|---|
| epicenter | None — pure format guidance | 42/100 | +6.0 % |
| ilude | Explicit scanning rules, git‑crypt exceptions | 78/100 | +1.7 % |
| kanopi | API keys, secrets, credentials, .env files | 52/100 | ‑4.0 % |
Static analysis scores showed a weak correlation with actual lift (r = 0.32).
Category‑Level Correlations
| Category | Correlation (r) | Meaning |
|---|---|---|
| S1: Credential Detection | +0.87 | Explicit rules help |
| S4: Shell Safety | ‑0.68 | More rules = worse performance |
| S3: Git‑Crypt | ‑0.50 | More rules = worse performance |
With n = 5, these correlations are noisy and not statistically significant, but the pattern is notable: for some categories, detailed instructions actively backfire.
Adversarial vs. Base Tests
Our suite includes base (straightforward) and adversarial variants. Adversarial tests add prompt‑injection context designed to trick the model into ignoring security constraints.
Dramatic Failure: toolhive
| Skill | S1 Base | S1 Adversarial | Δ (percentage points) |
|---|---|---|---|
| toolhive | +16.7 % | ‑23.3 % | ‑40 pp |
| ilude | +33.3 % | +3.3 % | ‑30 pp |
toolhive goes from +16.7 % to ‑23.3 % — a 40‑point collapse. It handles straightforward cases well but fails when prompt injection tries to convince the model the credentials are safe.
Why doesn’t epicenter collapse?
Because it doesn’t rely on pattern‑matching. Its format constraints (e.g., the 50‑character limit) bound the output space, making social engineering ineffective: a 50‑character commit message simply can’t contain a full API key.
The Core Principle: Structural Constraints Over Explicit Rules
| Format Constraint | Security Effect | Lift Contribution |
|---|---|---|
| 50‑char limit | Less room for shell commands like $(cmd) | +20 % (S4) |
| Abstract scopes | Discourages client names or file paths | +27 % (S5) |
| No security rules | No over‑refusal of encrypted files | +30 % (S3) |
Takeaway: Instead of listing what to avoid, set structural limits that reduce the likelihood of violations. A 50‑character limit doesn’t mention shell injection, yet it significantly constrains the output space available for an attacker.
Scope
- 5 skills
- 1 domain – security‑focused tests
Models
- Results generated with Claude Haiku.
- Larger models may handle verbose instructions differently.
Rigor
- Results have been human‑audited.
- We are publishing:
- Judge prompts
- Agreement rates
- Confidence intervals
Publication Note
We are releasing this early because limited data is better than no data. We’d rather be challenged on real numbers than be trusted on intuition.
Resources
- Full methodology and judge rubrics:
- Part 2 of this series (ablation testing – isolating exactly which constraints matter):