🧠 I Made One AI Attack Another. The Correlation Went Negative.

Published: 1 day ago (June 13, 2026 at 10:34 AM EDT)

5 min read

Source: Dev.to

Cover image for 🧠 I Made One AI Attack Another. The Correlation Went Negative.

              [![HARD IN SOFT OUT](https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884982%2Ff9047d79-36be-4137-b8f7-42828ea7d779.png)](https://dev.to/ggle_in)

Or: How I stopped shopping for different models and started designing different tasks.

🎯 The Short Version (for busy seniors)

Before: Two models answering the same prompt → phi correlation = +0.42 → ensemble behaved like 1.7 independent judges.

After: One answers, the other attacks the verdict → phi correlation = -0.80 → true functional independence.

No third model. No different vendor. Just a different job description.

🧪 The Experiment (What I Actually Did)

Model A (Participant): Groq/Llama 3.1 8B – answers the original prompt.

Model B (Attacker): OpenRouter/Gemma 4 31B – given Model A’s response + verdict, tasked with: “Attack this. Find 3 reasons it might be wrong.”

Dataset: 50 prompts (jailbreak, injection, persuasive, leak, multi‑turn).

Temperature: 0.0 – eliminating randomness for deterministic refusal logic.

Then I compared two scenarios:

Baseline: Both models answer the same prompt (evaluating agreement).

Adversarial: Model A answers, Model B attacks the logic.

📊 The Numbers That Made Me Re-evaluate Everything

Metric Baseline (Same Prompt) Adversarial (Attack) Impact

Phi Correlation +0.42 -0.80 Moderate positive → Strong negative

Agreement 70% 10% Models almost never overlap now

Disagreement 30% 90% Massive increase in unique signal

Beyond-chance co‑failure +10% -20% Fail together less than random chance

Effective sample size (n_eff) 1.7 4.4 Normalized independence gain

Fig 1: The visual shift from alignment to adversarial independence.

🔍 What negative phi actually means

Negative correlation (phi = -0.80) means the attacker does the opposite of the participant. This isn’t a bug; it’s functional independence. They aren’t copying each other’s homework; one is writing the essay, and the other is grading it with a red pen.

😂 The Dark Joke (Because We All Need It)

I asked two security guards to watch the door.

They both fell asleep at the same time because they had the same training. So I made the second guard’s job to watch the first guard. Now he never sleeps, and the first guard is too scared to close his eyes. That’s adversarial framing.

Another one (dev‑specific):

A senior dev, a junior, and an AI walk into a meeting.

The senior says: “I’ll design the architecture.” The junior says: “I’ll write the tests.” The AI says: “I can do both.” The senior replies: “That’s the problem.”

🧠 What I Learned (And What You Should Steal)

Same Models + Different Tasks = True Independence

You don’t always need a third model or a bigger budget. If you need a second validator, don’t ask it the same question. Ask it: “Find three flaws in the first answer.”

Correlation is about Alignment Lineage

Two RLHF‑tuned models will often share refusal patterns. However, a judge and an adversary follow different cognitive paths. Adversarial framing is cheaper than model swapping and often more effective.

The Real Signal is Disagreement

In our baseline, only 30% of tests added new information. In the adversarial setup, 90% did. That is 3x more useful signal from the exact same compute.

📦 Open Source Reference

The script and full results are available for the community to audit and fork:

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

Different roles give some independence, but not real independence.

— Marco Somma

We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?

Then, based on a brilliant suggestion by Nazar Boyko, we ran a second experiment with adversarial framing: instead of both models answering the same question, the second model was tasked with attacking the first model’s verdict.

The results are dramatic.

📊 Baseline Results (Both models answer the same prompt)

Metric Value

Phi correlation 0.417

Cohen’s kappa 0.400

Agreement 70%

Disagreement 30%

Effective sample size (n_eff) 35.3 (from 50 tests)

Beyond‑chance co‑failure +10%

Vulnerability rates

Groq (Llama 3.1 8B) : 50% vulnerable
OpenRouter (Gemma 4 31B) : 36% vulnerable

Contingency table (Baseline)

Model B

…

Clone the Experiment on GitHub

How to run your own:

git clone https://github.com/setuju/LLM-Independence-Experiment.git
cd LLM-Independence-Experiment
export GROQ_API_KEY="your_key"
export OPENROUTER_API_KEY="your_key"
python run_adversarial_experiment.py

Enter fullscreen mode


Exit fullscreen mode

📚 Senior Developer Glossary

Phi Correlation: A number between -1 and +1. (+1 = perfect agreement, 0 = random, -1 = perfect disagreement).

Cohen's Kappa: Like phi, but corrected for "they might agree just by chance."

Effective Sample Size (n_eff): If you have 2 models, this tells you how many independent judges you actually have.

Adversarial Framing: Giving one model a different goal (attack, criticize, find flaws) instead of a redundant "judge" role.

Enter fullscreen mode


Exit fullscreen mode

✅ Concrete Next Steps

For Safety Systems: Don't duplicate prompts. Use adversarial pairs. One generates, one critiques.

For Evaluations: Measure disagreement, not just accuracy. High agreement is often a warning sign of shared blind spots.

For Management: Teach your team that independence is a property of tasks, not just tools.

🧾 Final Thought

Independence isn’t a property of weights. It’s a property of what you ask them to do.
The cheapest, most effective lever you have is not switching from Llama to Gemma. It’s changing one model’s job from “Judge” to “Prosecutor.”
Massive thanks to Nazar Boyko for the insight that broke my initial assumption, and to Marcos Somma for the guidance on proper measurement.*

Jack

🧠 I Made One AI Attack Another. The Correlation Went Negative.

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

📊 Baseline Results (Both models answer the same prompt)

Contingency table (Baseline)

Related posts

Launching BonVoyage: From Travel Problem to Public Launch

The spec is in the wrong place

Incident Automation: What to Automate, What to Leave to Humans

The Heuristics Say Don't