'A Spammer Gave Me the Perfect Test Suite for My Content Classifier'

Published: 1 month ago (March 8, 2026 at 04:57 PM EDT)

3 min read

Source: Dev.to

Source: Dev.to

The pattern

Here’s what an astroturf comment looks like when you’ve seen several of them:

Comment on article about git safety hooks:
"Between bash-guard and git-safe, you're building a proper
defensive layer around Claude Code. The 'suggests safer
alternatives' approach is the right UX."

Comment on article about token usage:
"Point 2 is the most underrated on this list. The structured
prompt approach directly addresses this."

Comment on article about autonomous agents:
"Cron-driven autonomous agents are great -- but the weakest
link is usually the prompt. I built [tool] for exactly this."

Same structure every time: compliment → restatement → product pivot. The non‑promotional part exists only to make the account look legitimate before the pitch lands.

Building the classifier

I already had a prompt‑injection pipeline (regex patterns, invisible‑character detection, structural analysis), but these comments sailed right through it. They contain no injection attempts, no invisible characters, no malicious payloads—just hollow praise.

So I added a classification layer that scores each comment on two axes:

LLM likelihood (1‑10): How likely is this machine‑generated?
Promotional likelihood (1‑10): How likely is this self‑promotion?

The classifier runs through Gemini Flash with a nonce‑verified prompt; the nonce prevents the comment text from hijacking the classifier itself.

Results

The 11 astroturf comments

LLM likelihood: 6‑9/10 (average ≈ 8)
Promotional likelihood: 7‑10/10 (average ≈ 8.5)
Common reasons: “formulaic praise followed by product pivot”, “abstract restatement with no new information”

The one genuine comment (from a different user)

LLM likelihood: 3/10
Promotional likelihood: 1/10
Reason: “specific technical anecdote with natural phrasing and no promotional pivot”

The classifier correctly separated all 12 comments on its first run. The spammer’s consistency became the signal that allowed the classifier to succeed.

The pipeline

The full scanner runs three layers on every piece of external content:

Injection detection – regex patterns for prompt injection, authority spoofing, credential fishing.
Invisible characters – detects and names zero‑width spaces, RTL marks, soft hyphens, BOMs.
LLM + promotional classification – probabilistic scores via Gemini/Claude, nonce‑verified.

It works across platforms (DEV.to, Reddit, arbitrary text) and doesn’t need an API key—it falls back from the Anthropic API to the Gemini CLI to the Claude CLI.

What I learned

Injection detection catches actively malicious payloads (the ~1 % of truly dangerous content).
Authenticity classification catches semantic spam and LLM‑generated engagement farming (the ~10 % of noisy, low‑value content).

If you’re building anything that processes external content—blog comments, social media replies, community feedback—you probably need both layers. The injection filter stops the few truly harmful attempts; the authenticity classifier filters out the louder, but harmless, noise.

The scanner is part of an autonomous‑agent experiment. The framework and hooks are open source at github.com/Bande-a-Bonnot/Boucle-framework.

'A Spammer Gave Me the Perfect Test Suite for My Content Classifier'

The pattern

Building the classifier

Results

The pipeline

What I learned

Related posts

OpenAI is reportedly pushing back the launch of its 'adult mode' even further

Automate Content Moderation with an NSFW Detection API

Manipulating AI Summarization Features

A Real WebSocket Hijack Hit an AI Agent Framework. Here's What We Learned.