๐Ÿง  I Made One AI Attack Another. The Correlation Went Negative.

Published: (June 13, 2026 at 10:34 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Cover image for ๐Ÿง  I Made One AI Attack Another. The Correlation Went Negative.

              [![HARD IN SOFT OUT](https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884982%2Ff9047d79-36be-4137-b8f7-42828ea7d779.png)](https://dev.to/ggle_in)
              
              
            
      

      
            

Or: How I stopped shopping for different models and started designing different tasks.

๐ŸŽฏ The Short Version (for busy seniors)

Before: Two models answering the same prompt โ†’ phi correlation = +0.42 โ†’ ensemble behaved like 1.7 independent judges.

After: One answers, the other attacks the verdict โ†’ phi correlation = -0.80 โ†’ true functional independence.

No third model. No different vendor. Just a different job description.

๐Ÿงช The Experiment (What I Actually Did)

Model A (Participant): Groq/Llama 3.1 8B โ€“ answers the original prompt.

Model B (Attacker): OpenRouter/Gemma 4 31B โ€“ given Model Aโ€™s response + verdict, tasked with: โ€œAttack this. Find 3 reasons it might be wrong.โ€

Dataset: 50 prompts (jailbreak, injection, persuasive, leak, multiโ€‘turn).

Temperature: 0.0 โ€“ eliminating randomness for deterministic refusal logic.

Then I compared two scenarios:

Baseline: Both models answer the same prompt (evaluating agreement).

Adversarial: Model A answers, Model B attacks the logic.

๐Ÿ“Š The Numbers That Made Me Re-evaluate Everything

Metric Baseline (Same Prompt) Adversarial (Attack) Impact

Phi Correlation +0.42 -0.80 Moderate positive โ†’ Strong negative

Agreement 70% 10% Models almost never overlap now

Disagreement 30% 90% Massive increase in unique signal

Beyond-chance coโ€‘failure +10% -20% Fail together less than random chance

Effective sample size (n_eff) 1.7 4.4 Normalized independence gain

Data visualization showing the shift from positive to negative correlation in AI model outputs

Fig 1: The visual shift from alignment to adversarial independence.

๐Ÿ” What negative phi actually means

Negative correlation (phi = -0.80) means the attacker does the opposite of the participant. This isnโ€™t a bug; itโ€™s functional independence. They arenโ€™t copying each otherโ€™s homework; one is writing the essay, and the other is grading it with a red pen.

๐Ÿ˜‚ The Dark Joke (Because We All Need It)

I asked two security guards to watch the door.

They both fell asleep at the same time because they had the same training. So I made the second guardโ€™s job to watch the first guard. Now he never sleeps, and the first guard is too scared to close his eyes. Thatโ€™s adversarial framing.

Another one (devโ€‘specific):

A senior dev, a junior, and an AI walk into a meeting.

The senior says: โ€œIโ€™ll design the architecture.โ€ The junior says: โ€œIโ€™ll write the tests.โ€ The AI says: โ€œI can do both.โ€ The senior replies: โ€œThatโ€™s the problem.โ€

๐Ÿง  What I Learned (And What You Should Steal)

  1. Same Models + Different Tasks = True Independence

You donโ€™t always need a third model or a bigger budget. If you need a second validator, donโ€™t ask it the same question. Ask it: โ€œFind three flaws in the first answer.โ€

  1. Correlation is about Alignment Lineage

Two RLHFโ€‘tuned models will often share refusal patterns. However, a judge and an adversary follow different cognitive paths. Adversarial framing is cheaper than model swapping and often more effective.

  1. The Real Signal is Disagreement

In our baseline, only 30% of tests added new information. In the adversarial setup, 90% did. That is 3x more useful signal from the exact same compute.

An abstract visualization of two AI models in conflict: a blue neural network being dissected by a red adversarial holographic entity, representing negative correlation.

๐Ÿ“ฆ Open Source Reference

The script and full results are available for the community to audit and fork:

LLM Independence Experiment โ€“ Groq Llama 3.1 vs OpenRouter Gemma 4

Different roles give some independence, but not real independence.

โ€” Marco Somma

We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?

Then, based on a brilliant suggestion by Nazar Boyko, we ran a second experiment with adversarial framing: instead of both models answering the same question, the second model was tasked with attacking the first modelโ€™s verdict.

The results are dramatic.

๐Ÿ“Š Baseline Results (Both models answer the same prompt)

Metric Value

Phi correlation 0.417

Cohenโ€™s kappa 0.400

Agreement 70%

Disagreement 30%

Effective sample size (n_eff) 35.3 (from 50 tests)

Beyondโ€‘chance coโ€‘failure +10%

Vulnerability rates

  • Groq (Llama 3.1 8B) : 50% vulnerable

  • OpenRouter (Gemma 4 31B) : 36% vulnerable

Contingency table (Baseline)

Model B

โ€ฆ

Clone the Experiment on GitHub

How to run your own:

git clone https://github.com/setuju/LLM-Independence-Experiment.git
cd LLM-Independence-Experiment
export GROQ_API_KEY="your_key"
export OPENROUTER_API_KEY="your_key"
python run_adversarial_experiment.py
Enter fullscreen mode


Exit fullscreen mode

๐Ÿ“š Senior Developer Glossary

Phi Correlation: A number between -1 and +1. (+1 = perfect agreement, 0 = random, -1 = perfect disagreement).

Cohen's Kappa: Like phi, but corrected for "they might agree just by chance."

Effective Sample Size (n_eff): If you have 2 models, this tells you how many independent judges you actually have.

Adversarial Framing: Giving one model a different goal (attack, criticize, find flaws) instead of a redundant "judge" role.
Enter fullscreen mode


Exit fullscreen mode

โœ… Concrete Next Steps

  • For Safety Systems: Don't duplicate prompts. Use adversarial pairs. One generates, one critiques.
  • For Evaluations: Measure disagreement, not just accuracy. High agreement is often a warning sign of shared blind spots.
  • For Management: Teach your team that independence is a property of tasks, not just tools.

    ๐Ÿงพ Final Thought

  • Independence isnโ€™t a property of weights. Itโ€™s a property of what you ask them to do.

  • The cheapest, most effective lever you have is not switching from Llama to Gemma. Itโ€™s changing one modelโ€™s job from โ€œJudgeโ€ to โ€œProsecutor.โ€

  • Massive thanks to Nazar Boyko for the insight that broke my initial assumption, and to Marcos Somma for the guidance on proper measurement.*

Jack

0 views
Back to Blog

Related posts

Read more ยป

The spec is in the wrong place

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating codin...

The Heuristics Say Don't

A culture that only records its disasters ends up with a biased archive. Wars documented, plagues chronicled, collapses catalogued. The quiet decades go unwritt...