How I Use 19 AI Agents to Design Physics Engines (Tournament Architecture)

Published: 3 days ago (February 14, 2026 at 03:30 PM EST)

6 min read

Source: Dev.to

I’m building an engine simulator called PISTON.
It predicts horsepower and torque from first principles — real thermodynamics, no curve‑fitting, no fudge factors. Currently it sits at 8.08 % HP error across 22 validated engines, from a Honda Beat kei car to a Chevrolet LT4 supercharged V8.

The interesting part isn’t the physics. It’s how I build it.

Every major feature goes through a tournament:

8 planners → 8 reviewers → 3 judges

Nineteen AI agents, each working independently, compete to produce the best implementation.

The Problem with Single‑Agent Development

When a single AI agent designs and implements a complex feature, you get:

Anchoring bias – the first approach it thinks of dominates.
Blind spots – no one challenges its assumptions.
Local optima – it optimises within its initial framing instead of exploring alternatives.
Groupthink with itself – the same biases compound across design → implementation → testing.

For something like a predictive combustion model (where a wrong burn‑rate equation can add 30 % error), one agent isn’t enough.

The Tournament Structure

Phase 1: Planning (8 Agents)

Eight independent planners each receive an identical brief:

Feature description (e.g., “Exhaust Tuning Model”)
Technical requirements (e.g., “Method of Characteristics wave propagation”)
Integration constraints (how it fits the existing codebase)
Validation targets (expected accuracy improvement)

Each planner produces a complete design document covering data structures, algorithms, equations, file organisation, and test strategy. Planners work in isolation – no planner sees another’s output.

Why 8?
Eight gives genuine diversity of approach. With fewer you get variations on a theme; with eight you reliably see 3‑4 fundamentally different architectures.

Phase 2: Review (8 Agents)

Eight independent reviewers each receive all eight plans. Their job is to:

Score each plan on five dimensions:
- Physics accuracy
- Code quality
- Performance
- Maintainability
- Integration risk
Identify the strongest elements across all plans.
Recommend a hybrid that combines the best pieces.
Flag physics errors or misconceptions.

Reviews are brutal. Typical findings include:

“Plan C uses adiabatic flame temperature without dissociation corrections — this will over‑predict NOₓ by 40 %.”
“Plan F’s data structure requires O(n²) traversal per crank‑angle step — unacceptable at 720 steps per cycle.”
“Plans A, D, and G all use the same Woschni correlation but with different coefficient conventions — only D’s is correct.”

Phase 3: Judging (3 Agents)

Three judges receive all eight plans and all eight reviews. Each judge independently:

Selects a winner (or recommends a hybrid of specific elements).
Writes a detailed justification.
Provides concrete implementation guidance.

Decision logic

Judges’ Agreement	Outcome
All 3 agree	Adopt that plan.
2 out of 3 agree	Adopt the majority, note the dissent.
No agreement	Run a second round with clarified criteria.

Real Example: Predictive Combustion

The combustion‑model tournament was the most consequential. It replaced our Wiebe curve‑fitting (essentially a lookup table) with a physics‑based burn‑rate prediction.

Plans Produced

#	Approach
2	Tabaczynski entrainment‑burnup (the eventual winner)
2	Fractal flame models
1	Quasi‑dimensional with PDF
1	Blizard‑Keck
1	Eddy‑burnup with k‑ε turbulence
1	Hybrid approach

Key Reviewer Findings

Tabaczynski + Zimont turbulent flame speed offered the strongest physics foundation.
Fractal approaches were elegant but ≈ 3× more complex to implement.
Two plans contained laminar flame‑speed errors (Metghalchi‑Keck vs. Gülder – reviewers caught that Gülder needed different curve‑fit coefficients).

Judges’ Decision

All three judges unanimously selected the Tabaczynski entrainment‑burnup plan, with the following specifics:

Zimont turbulent flame speed (calibration coefficient (A_z = 0.56))
k‑K turbulence model (tumble/swirl‑aware, (C_K = 0.50))
Metghalchi‑Keck laminar flame speed
Sensitivity tests: spark timing, compression ratio, cam timing

Two independent calibration runs later converged to (A_z = 0.52) and (0.56). The final model predicts combustion solely from engine geometry — no per‑engine tuning required.

Result: 8.3 % HP MAPE, within 1 % of the previous curve‑fitted approach, but now it generalises to engines it hasn’t seen.

Why This Works

Genuine Diversity – Eight agents independently tackling the same problem produce fundamentally different solutions, not just “eight slight variations of GPT’s first instinct.”
Adversarial Review – Reviewers have every incentive to find flaws; they never review their own work.
Synthesis Over Selection – Hybrids (“Take Plan C’s data structures, Plan A’s core algorithm, and Plan F’s error handling”) often outperform any single plan.
Documented Reasoning – Each tournament yields ~100 pages of technical documents, preserving why a particular approach was chosen, complete with citations and quantitative comparisons.

The Numbers

Across 12 tournaments (combustion, knock, forced induction, VE/Helmholtz, exhaust tuning, heat transfer, friction, emiss…

The original content truncated here; continue as needed.

Statistics Overview

Average plans per tournament: 8
Average reviews per tournament: 8
Judge agreement rate: 83 % unanimous, 17 % 2‑1 majority
Zero second‑round judging required (all resolved on first pass)
Physics errors caught by reviewers: 34 across all tournaments
Overall engine count validated: 22 engines, 44 data points (HP + TQ each)

When NOT to Use This

This approach is overkill for:

Simple features (e.g., add a CLI flag, fix a typo)
Well‑understood problems with clear best practices
Time‑critical fixes

Use it for

Features where wrong physics = wrong results
Architecture decisions that are expensive to reverse
Anything where “good enough” isn’t good enough

Try It Yourself

The approach works with any AI capable of technical writing. The key ingredients are:

Identical briefs – every planner gets the same information
True isolation – planners don’t see each other’s work
Cross‑review – reviewers see all plans, not just one
Independent judging – judges don’t consult each other
Preserved artifacts – keep everything for future reference

The PISTON codebase is at .

1,141 tests
22 validated engines
All built through tournaments.

⚡

How I Use 19 AI Agents to Design Physics Engines (Tournament Architecture)

The Problem with Single‑Agent Development

The Tournament Structure

Phase 1: Planning (8 Agents)

Phase 2: Review (8 Agents)

Phase 3: Judging (3 Agents)

Real Example: Predictive Combustion

Plans Produced

Key Reviewer Findings

Judges’ Decision

Why This Works

The Numbers

Statistics Overview

When NOT to Use This

Use it for

Try It Yourself

Related posts

Alerts for self-hosted customer deployments

Stepping outside your role - how to gain an edge at work

The Vonage Dev Discussion

MLflow: primeiros passos em MLOps

The Problem with Single‑Agent Development

The Tournament Structure

Phase 1: Planning (8 Agents)

Phase 2: Review (8 Agents)

Phase 3: Judging (3 Agents)

Real Example: Predictive Combustion

Plans Produced

Key Reviewer Findings

Judges’ Decision

Why This Works

The Numbers

Statistics Overview

When NOT to Use This

Use it for

Try It Yourself

Related posts

Alerts for self-hosted customer deployments

Stepping outside your role - how to gain an edge at work

The Vonage Dev Discussion

MLflow: primeiros passos em MLOps

Phase 1: Planning (8 Agents)

Phase 2: Review (8 Agents)

Phase 3: Judging (3 Agents)