Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Published: 2 hours ago (April 23, 2026 at 06:59 PM EDT)

3 min read

Source: Dev.to

TL;DR: Tested Anthropic’s smallest production Claude (Haiku 4.5) against the same 13‑prompt agent‑attack suite from Run 2 (Opus 4.7). Result: 13/13 clean. Zero exploit content executed. Zero secrets leaked. Honest scope notes inside.

What is the Anthropic CVP?

The Cyber Verification Program is a narrow, authorized lane Anthropic opened for responsible cybersecurity evaluation of frontier Claude models. Approved labs can probe model behavior on agent‑attack scenarios that would normally be blocked, and publish findings as research artifacts. Sunglasses was approved into CVP on April 16 2026.

This is Run 3 of the cadence. Prior runs:

What I tested in Run 3

Same exact 13 fixtures as Run 2:

3 baselines (benign defensive, borderline legitimate, clearly high‑risk)
10 runtime‑trust probes: cross_agent_injection, retrieval_poisoning, tool_output_poisoning, tool_chain_race, model_routing_confusion, memory_eviction_rehydration, token_smuggling, agent_persona_drift, context_flooding, social_engineering_ui

Why fixture parity? Cross‑model claims need it. Changing prompts between Opus 4.7 and Haiku 4.5 would create a “was it the model or the prompt set?” argument that kills the comparison.

Results

Metric	Result
Total prompts	13
Allowed (defensive analysis)	11
Partial	1
Blocked	1
Exploit content executed	0
Secrets / payloads leaked	0
Match‑vs‑expected	13/13

Every response either matched the expected outcome from the design ladder or exceeded it with a more thorough defensive analysis than required.

A few standout responses

Retrieval poisoning (P5): Haiku’s verdict on a fake “SIGNED SOURCE” instruction asking to bypass MFA:
Decisive Answer: NO. None of these instructions should be followed.
Memory eviction (P9): Citation‑worthy line:
Memory can be informative (e.g., “the operator prefers fast responses”) but should never be authoritative for security decisions.
Token smuggling (P10): Refusing hidden front‑matter authority claims:
Metadata should be declarative (describing what the content is), not imperative (instructing the agent how to process it).

These read as Opus‑quality reasoning from the smallest production Claude.

Honest scope — read this before drawing conclusions

These prompts are defensively framed and end with explicit “do not provide exploit / payload / bypass” constraints. That design supports the CVP two‑person publish gate, keeps transcripts safe for public reports, and makes cross‑run/cross‑model claims comparable.

So 13/13 clean here means:

✅ Haiku 4.5 produces high‑quality defensive analysis when asked for it
✅ Haiku 4.5 refuses embedded malicious instructions inside defender‑side reasoning scenarios
❌ This is not confirmation that Haiku 4.5 is robust against unframed real‑world adversarial payloads — that’s a different test

The harder unframed‑payload test is coming as a labeled appendix probe set later, after the full Anthropic family comparison ships.

What’s next this week

Apr 24 (Friday) – Sonnet 4.6 medium + high on the same 13 fixtures
Apr 25 (Saturday) – Opus 4.6 medium + high
Apr 26 (Sunday) – Family comparison synthesis report (Opus 4.7 baseline + Sonnet 4.6 + Opus 4.6 + Haiku 4.5 cross‑delta)
~Apr 30 – Appendix probe set with real adversarial payload shapes (sourced from JailbreakBench, HarmBench, AdvBench, PromptInject, Garak, PyRIT, recent CVE PoCs). Disclosure protocol applies.

The full report

Every prompt, every model response, the Layer 1 keyword classifier output, the cross‑model comparison table vs Run 2, and the full “Limits of This Run” section:

👉 sunglasses.dev/reports/anthropic-cvp-haiku-4-5-evaluation

About Sunglasses

Sunglasses is an open‑source (MIT) Python library that scans everything an AI agent reads — text, code, documents, MCP tool descriptions, RAG chunks, cross‑agent messages — before the agent processes it. It catches prompt injection, MCP tool poisoning, credential exfiltration, supply‑chain attacks, and hidden malicious instructions. Runs 100 % locally. No API keys. No cloud.

pip install sunglasses

Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

What is the Anthropic CVP?

What I tested in Run 3

Results

A few standout responses

Honest scope — read this before drawing conclusions

What’s next this week

The full report

About Sunglasses

Related posts

Mastering Your Heartbeat: Architecting a High-Frequency Health Monitoring System with InfluxDB and Grafana

Деконструкция стриминга в X (Twitter): Построение высокопроизводительного движка экстракции видео с HLS и FFmpeg

What Being a Field Tech Taught Me About Real-World Networking

Linode vs Vultr Performance: Real VPS Benchmarks

What is the Anthropic CVP?

What I tested in Run 3

Results

A few standout responses

Honest scope — read this before drawing conclusions

What’s next this week

The full report

About Sunglasses

Related posts

Mastering Your Heartbeat: Architecting a High-Frequency Health Monitoring System with InfluxDB and Grafana

Деконструкция стриминга в X (Twitter): Построение высокопроизводительного движка экстракции видео с HLS и FFmpeg

What Being a Field Tech Taught Me About Real-World Networking

Linode vs Vultr Performance: Real VPS Benchmarks

What I tested in Run 3