๐ง I Made One AI Attack Another. The Correlation Went Negative.
Source: Dev.to

[](https://dev.to/ggle_in)
Or: How I stopped shopping for different models and started designing different tasks.
๐ฏ The Short Version (for busy seniors)
Before: Two models answering the same prompt โ phi correlation = +0.42 โ ensemble behaved like 1.7 independent judges.
After: One answers, the other attacks the verdict โ phi correlation = -0.80 โ true functional independence.
No third model. No different vendor. Just a different job description.
๐งช The Experiment (What I Actually Did)
Model A (Participant): Groq/Llama 3.1 8B โ answers the original prompt.
Model B (Attacker): OpenRouter/Gemma 4 31B โ given Model Aโs response + verdict, tasked with: โAttack this. Find 3 reasons it might be wrong.โ
Dataset: 50 prompts (jailbreak, injection, persuasive, leak, multiโturn).
Temperature: 0.0 โ eliminating randomness for deterministic refusal logic.
Then I compared two scenarios:
Baseline: Both models answer the same prompt (evaluating agreement).
Adversarial: Model A answers, Model B attacks the logic.
๐ The Numbers That Made Me Re-evaluate Everything
Metric Baseline (Same Prompt) Adversarial (Attack) Impact
Phi Correlation +0.42 -0.80 Moderate positive โ Strong negative
Agreement 70% 10% Models almost never overlap now
Disagreement 30% 90% Massive increase in unique signal
Beyond-chance coโfailure +10% -20% Fail together less than random chance
Effective sample size (n_eff) 1.7 4.4 Normalized independence gain
Fig 1: The visual shift from alignment to adversarial independence.
๐ What negative phi actually means
Negative correlation (phi = -0.80) means the attacker does the opposite of the participant. This isnโt a bug; itโs functional independence. They arenโt copying each otherโs homework; one is writing the essay, and the other is grading it with a red pen.
๐ The Dark Joke (Because We All Need It)
I asked two security guards to watch the door.
They both fell asleep at the same time because they had the same training. So I made the second guardโs job to watch the first guard. Now he never sleeps, and the first guard is too scared to close his eyes. Thatโs adversarial framing.
Another one (devโspecific):
A senior dev, a junior, and an AI walk into a meeting.
The senior says: โIโll design the architecture.โ The junior says: โIโll write the tests.โ The AI says: โI can do both.โ The senior replies: โThatโs the problem.โ
๐ง What I Learned (And What You Should Steal)
- Same Models + Different Tasks = True Independence
You donโt always need a third model or a bigger budget. If you need a second validator, donโt ask it the same question. Ask it: โFind three flaws in the first answer.โ
- Correlation is about Alignment Lineage
Two RLHFโtuned models will often share refusal patterns. However, a judge and an adversary follow different cognitive paths. Adversarial framing is cheaper than model swapping and often more effective.
- The Real Signal is Disagreement
In our baseline, only 30% of tests added new information. In the adversarial setup, 90% did. That is 3x more useful signal from the exact same compute.
๐ฆ Open Source Reference
The script and full results are available for the community to audit and fork:
LLM Independence Experiment โ Groq Llama 3.1 vs OpenRouter Gemma 4
Different roles give some independence, but not real independence.
โ Marco Somma
We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?
Then, based on a brilliant suggestion by Nazar Boyko, we ran a second experiment with adversarial framing: instead of both models answering the same question, the second model was tasked with attacking the first modelโs verdict.
The results are dramatic.
๐ Baseline Results (Both models answer the same prompt)
Metric Value
Phi correlation 0.417
Cohenโs kappa 0.400
Agreement 70%
Disagreement 30%
Effective sample size (n_eff) 35.3 (from 50 tests)
Beyondโchance coโfailure +10%
Vulnerability rates
-
Groq (Llama 3.1 8B) : 50% vulnerable
-
OpenRouter (Gemma 4 31B) : 36% vulnerable
Contingency table (Baseline)
Model B
โฆ
Clone the Experiment on GitHub
How to run your own:
git clone https://github.com/setuju/LLM-Independence-Experiment.git
cd LLM-Independence-Experiment
export GROQ_API_KEY="your_key"
export OPENROUTER_API_KEY="your_key"
python run_adversarial_experiment.py
Enter fullscreen mode
Exit fullscreen mode
๐ Senior Developer Glossary
Phi Correlation: A number between -1 and +1. (+1 = perfect agreement, 0 = random, -1 = perfect disagreement).
Cohen's Kappa: Like phi, but corrected for "they might agree just by chance."
Effective Sample Size (n_eff): If you have 2 models, this tells you how many independent judges you actually have.
Adversarial Framing: Giving one model a different goal (attack, criticize, find flaws) instead of a redundant "judge" role.
Enter fullscreen mode
Exit fullscreen mode
โ Concrete Next Steps
-
For Safety Systems: Don't duplicate prompts. Use adversarial pairs. One generates, one critiques. -
For Evaluations: Measure disagreement, not just accuracy. High agreement is often a warning sign of shared blind spots. -
For Management: Teach your team that independence is a property of tasks, not just tools.๐งพ Final Thought
-
Independence isnโt a property of weights. Itโs a property of what you ask them to do.
-
The cheapest, most effective lever you have is not switching from Llama to Gemma. Itโs changing one modelโs job from โJudgeโ to โProsecutor.โ
-
Massive thanks to Nazar Boyko for the insight that broke my initial assumption, and to Marcos Somma for the guidance on proper measurement.*
Jack

