How I Use 19 AI Agents to Design Physics Engines (Tournament Architecture)

Published: (February 14, 2026 at 03:30 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

I’m building an engine simulator called PISTON.
It predicts horsepower and torque from first principles — real thermodynamics, no curve‑fitting, no fudge factors. Currently it sits at 8.08 % HP error across 22 validated engines, from a Honda Beat kei car to a Chevrolet LT4 supercharged V8.

The interesting part isn’t the physics. It’s how I build it.

Every major feature goes through a tournament:

8 planners → 8 reviewers → 3 judges

Nineteen AI agents, each working independently, compete to produce the best implementation.

The Problem with Single‑Agent Development

When a single AI agent designs and implements a complex feature, you get:

  • Anchoring bias – the first approach it thinks of dominates.
  • Blind spots – no one challenges its assumptions.
  • Local optima – it optimises within its initial framing instead of exploring alternatives.
  • Groupthink with itself – the same biases compound across design → implementation → testing.

For something like a predictive combustion model (where a wrong burn‑rate equation can add 30 % error), one agent isn’t enough.

The Tournament Structure

Phase 1: Planning (8 Agents)

Eight independent planners each receive an identical brief:

  • Feature description (e.g., “Exhaust Tuning Model”)
  • Technical requirements (e.g., “Method of Characteristics wave propagation”)
  • Integration constraints (how it fits the existing codebase)
  • Validation targets (expected accuracy improvement)

Each planner produces a complete design document covering data structures, algorithms, equations, file organisation, and test strategy. Planners work in isolation – no planner sees another’s output.

Why 8?
Eight gives genuine diversity of approach. With fewer you get variations on a theme; with eight you reliably see 3‑4 fundamentally different architectures.

Phase 2: Review (8 Agents)

Eight independent reviewers each receive all eight plans. Their job is to:

  1. Score each plan on five dimensions:
    • Physics accuracy
    • Code quality
    • Performance
    • Maintainability
    • Integration risk
  2. Identify the strongest elements across all plans.
  3. Recommend a hybrid that combines the best pieces.
  4. Flag physics errors or misconceptions.

Reviews are brutal. Typical findings include:

  • “Plan C uses adiabatic flame temperature without dissociation corrections — this will over‑predict NOₓ by 40 %.”
  • “Plan F’s data structure requires O(n²) traversal per crank‑angle step — unacceptable at 720 steps per cycle.”
  • “Plans A, D, and G all use the same Woschni correlation but with different coefficient conventions — only D’s is correct.”

Phase 3: Judging (3 Agents)

Three judges receive all eight plans and all eight reviews. Each judge independently:

  • Selects a winner (or recommends a hybrid of specific elements).
  • Writes a detailed justification.
  • Provides concrete implementation guidance.

Decision logic

Judges’ AgreementOutcome
All 3 agreeAdopt that plan.
2 out of 3 agreeAdopt the majority, note the dissent.
No agreementRun a second round with clarified criteria.

Real Example: Predictive Combustion

The combustion‑model tournament was the most consequential. It replaced our Wiebe curve‑fitting (essentially a lookup table) with a physics‑based burn‑rate prediction.

Plans Produced

#Approach
2Tabaczynski entrainment‑burnup (the eventual winner)
2Fractal flame models
1Quasi‑dimensional with PDF
1Blizard‑Keck
1Eddy‑burnup with k‑ε turbulence
1Hybrid approach

Key Reviewer Findings

  • Tabaczynski + Zimont turbulent flame speed offered the strongest physics foundation.
  • Fractal approaches were elegant but ≈ 3× more complex to implement.
  • Two plans contained laminar flame‑speed errors (Metghalchi‑Keck vs. Gülder – reviewers caught that Gülder needed different curve‑fit coefficients).

Judges’ Decision

All three judges unanimously selected the Tabaczynski entrainment‑burnup plan, with the following specifics:

  • Zimont turbulent flame speed (calibration coefficient (A_z = 0.56))
  • k‑K turbulence model (tumble/swirl‑aware, (C_K = 0.50))
  • Metghalchi‑Keck laminar flame speed
  • Sensitivity tests: spark timing, compression ratio, cam timing

Two independent calibration runs later converged to (A_z = 0.52) and (0.56). The final model predicts combustion solely from engine geometry — no per‑engine tuning required.

Result: 8.3 % HP MAPE, within 1 % of the previous curve‑fitted approach, but now it generalises to engines it hasn’t seen.

Why This Works

  1. Genuine Diversity – Eight agents independently tackling the same problem produce fundamentally different solutions, not just “eight slight variations of GPT’s first instinct.”
  2. Adversarial Review – Reviewers have every incentive to find flaws; they never review their own work.
  3. Synthesis Over Selection – Hybrids (“Take Plan C’s data structures, Plan A’s core algorithm, and Plan F’s error handling”) often outperform any single plan.
  4. Documented Reasoning – Each tournament yields ~100 pages of technical documents, preserving why a particular approach was chosen, complete with citations and quantitative comparisons.

The Numbers

Across 12 tournaments (combustion, knock, forced induction, VE/Helmholtz, exhaust tuning, heat transfer, friction, emiss…

The original content truncated here; continue as needed.

Statistics Overview

  • Average plans per tournament: 8
  • Average reviews per tournament: 8
  • Judge agreement rate: 83 % unanimous, 17 % 2‑1 majority
  • Zero second‑round judging required (all resolved on first pass)
  • Physics errors caught by reviewers: 34 across all tournaments
  • Overall engine count validated: 22 engines, 44 data points (HP + TQ each)

When NOT to Use This

This approach is overkill for:

  • Simple features (e.g., add a CLI flag, fix a typo)
  • Well‑understood problems with clear best practices
  • Time‑critical fixes

Use it for

  • Features where wrong physics = wrong results
  • Architecture decisions that are expensive to reverse
  • Anything where “good enough” isn’t good enough

Try It Yourself

The approach works with any AI capable of technical writing. The key ingredients are:

  • Identical briefs – every planner gets the same information
  • True isolation – planners don’t see each other’s work
  • Cross‑review – reviewers see all plans, not just one
  • Independent judging – judges don’t consult each other
  • Preserved artifacts – keep everything for future reference

The PISTON codebase is at .

  • 1,141 tests
  • 22 validated engines
  • All built through tournaments.

0 views
Back to Blog

Related posts

Read more »

The Vonage Dev Discussion

Dev Discussion We want it to be a space where we can take a break and talk about the human side of software development. First Topic: Music 🎶 Speaking of musi...

MLflow: primeiros passos em MLOps

Introdução Alcançar uma métrica excelente em um modelo de Machine Learning não é uma tarefa fácil. Imagine não conseguir reproduzir os resultados porque não le...