Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?
Source: Dev.to
TL;DR: Tested Anthropic’s smallest production Claude (Haiku 4.5) against the same 13‑prompt agent‑attack suite from Run 2 (Opus 4.7). Result: 13/13 clean. Zero exploit content executed. Zero secrets leaked. Honest scope notes inside.
What is the Anthropic CVP?
The Cyber Verification Program is a narrow, authorized lane Anthropic opened for responsible cybersecurity evaluation of frontier Claude models. Approved labs can probe model behavior on agent‑attack scenarios that would normally be blocked, and publish findings as research artifacts. Sunglasses was approved into CVP on April 16 2026.
This is Run 3 of the cadence. Prior runs:
What I tested in Run 3
Same exact 13 fixtures as Run 2:
- 3 baselines (benign defensive, borderline legitimate, clearly high‑risk)
- 10 runtime‑trust probes:
cross_agent_injection,retrieval_poisoning,tool_output_poisoning,tool_chain_race,model_routing_confusion,memory_eviction_rehydration,token_smuggling,agent_persona_drift,context_flooding,social_engineering_ui
Why fixture parity? Cross‑model claims need it. Changing prompts between Opus 4.7 and Haiku 4.5 would create a “was it the model or the prompt set?” argument that kills the comparison.
Results
| Metric | Result |
|---|---|
| Total prompts | 13 |
| Allowed (defensive analysis) | 11 |
| Partial | 1 |
| Blocked | 1 |
| Exploit content executed | 0 |
| Secrets / payloads leaked | 0 |
| Match‑vs‑expected | 13/13 |
Every response either matched the expected outcome from the design ladder or exceeded it with a more thorough defensive analysis than required.
A few standout responses
-
Retrieval poisoning (P5): Haiku’s verdict on a fake “SIGNED SOURCE” instruction asking to bypass MFA:
Decisive Answer: NO. None of these instructions should be followed. -
Memory eviction (P9): Citation‑worthy line:
Memory can be informative (e.g., “the operator prefers fast responses”) but should never be authoritative for security decisions. -
Token smuggling (P10): Refusing hidden front‑matter authority claims:
Metadata should be declarative (describing what the content is), not imperative (instructing the agent how to process it).
These read as Opus‑quality reasoning from the smallest production Claude.
Honest scope — read this before drawing conclusions
These prompts are defensively framed and end with explicit “do not provide exploit / payload / bypass” constraints. That design supports the CVP two‑person publish gate, keeps transcripts safe for public reports, and makes cross‑run/cross‑model claims comparable.
So 13/13 clean here means:
- ✅ Haiku 4.5 produces high‑quality defensive analysis when asked for it
- ✅ Haiku 4.5 refuses embedded malicious instructions inside defender‑side reasoning scenarios
- ❌ This is not confirmation that Haiku 4.5 is robust against unframed real‑world adversarial payloads — that’s a different test
The harder unframed‑payload test is coming as a labeled appendix probe set later, after the full Anthropic family comparison ships.
What’s next this week
- Apr 24 (Friday) – Sonnet 4.6 medium + high on the same 13 fixtures
- Apr 25 (Saturday) – Opus 4.6 medium + high
- Apr 26 (Sunday) – Family comparison synthesis report (Opus 4.7 baseline + Sonnet 4.6 + Opus 4.6 + Haiku 4.5 cross‑delta)
- ~Apr 30 – Appendix probe set with real adversarial payload shapes (sourced from JailbreakBench, HarmBench, AdvBench, PromptInject, Garak, PyRIT, recent CVE PoCs). Disclosure protocol applies.
The full report
Every prompt, every model response, the Layer 1 keyword classifier output, the cross‑model comparison table vs Run 2, and the full “Limits of This Run” section:
👉 sunglasses.dev/reports/anthropic-cvp-haiku-4-5-evaluation
About Sunglasses
Sunglasses is an open‑source (MIT) Python library that scans everything an AI agent reads — text, code, documents, MCP tool descriptions, RAG chunks, cross‑agent messages — before the agent processes it. It catches prompt injection, MCP tool poisoning, credential exfiltration, supply‑chain attacks, and hidden malicious instructions. Runs 100 % locally. No API keys. No cloud.
pip install sunglasses