How I Use 19 AI Agents to Design Physics Engines (Tournament Architecture)
Source: Dev.to
I’m building an engine simulator called PISTON.
It predicts horsepower and torque from first principles — real thermodynamics, no curve‑fitting, no fudge factors. Currently it sits at 8.08 % HP error across 22 validated engines, from a Honda Beat kei car to a Chevrolet LT4 supercharged V8.
The interesting part isn’t the physics. It’s how I build it.
Every major feature goes through a tournament:
8 planners → 8 reviewers → 3 judges
Nineteen AI agents, each working independently, compete to produce the best implementation.
The Problem with Single‑Agent Development
When a single AI agent designs and implements a complex feature, you get:
- Anchoring bias – the first approach it thinks of dominates.
- Blind spots – no one challenges its assumptions.
- Local optima – it optimises within its initial framing instead of exploring alternatives.
- Groupthink with itself – the same biases compound across design → implementation → testing.
For something like a predictive combustion model (where a wrong burn‑rate equation can add 30 % error), one agent isn’t enough.
The Tournament Structure
Phase 1: Planning (8 Agents)
Eight independent planners each receive an identical brief:
- Feature description (e.g., “Exhaust Tuning Model”)
- Technical requirements (e.g., “Method of Characteristics wave propagation”)
- Integration constraints (how it fits the existing codebase)
- Validation targets (expected accuracy improvement)
Each planner produces a complete design document covering data structures, algorithms, equations, file organisation, and test strategy. Planners work in isolation – no planner sees another’s output.
Why 8?
Eight gives genuine diversity of approach. With fewer you get variations on a theme; with eight you reliably see 3‑4 fundamentally different architectures.
Phase 2: Review (8 Agents)
Eight independent reviewers each receive all eight plans. Their job is to:
- Score each plan on five dimensions:
- Physics accuracy
- Code quality
- Performance
- Maintainability
- Integration risk
- Identify the strongest elements across all plans.
- Recommend a hybrid that combines the best pieces.
- Flag physics errors or misconceptions.
Reviews are brutal. Typical findings include:
- “Plan C uses adiabatic flame temperature without dissociation corrections — this will over‑predict NOₓ by 40 %.”
- “Plan F’s data structure requires O(n²) traversal per crank‑angle step — unacceptable at 720 steps per cycle.”
- “Plans A, D, and G all use the same Woschni correlation but with different coefficient conventions — only D’s is correct.”
Phase 3: Judging (3 Agents)
Three judges receive all eight plans and all eight reviews. Each judge independently:
- Selects a winner (or recommends a hybrid of specific elements).
- Writes a detailed justification.
- Provides concrete implementation guidance.
Decision logic
| Judges’ Agreement | Outcome |
|---|---|
| All 3 agree | Adopt that plan. |
| 2 out of 3 agree | Adopt the majority, note the dissent. |
| No agreement | Run a second round with clarified criteria. |
Real Example: Predictive Combustion
The combustion‑model tournament was the most consequential. It replaced our Wiebe curve‑fitting (essentially a lookup table) with a physics‑based burn‑rate prediction.
Plans Produced
| # | Approach |
|---|---|
| 2 | Tabaczynski entrainment‑burnup (the eventual winner) |
| 2 | Fractal flame models |
| 1 | Quasi‑dimensional with PDF |
| 1 | Blizard‑Keck |
| 1 | Eddy‑burnup with k‑ε turbulence |
| 1 | Hybrid approach |
Key Reviewer Findings
- Tabaczynski + Zimont turbulent flame speed offered the strongest physics foundation.
- Fractal approaches were elegant but ≈ 3× more complex to implement.
- Two plans contained laminar flame‑speed errors (Metghalchi‑Keck vs. Gülder – reviewers caught that Gülder needed different curve‑fit coefficients).
Judges’ Decision
All three judges unanimously selected the Tabaczynski entrainment‑burnup plan, with the following specifics:
- Zimont turbulent flame speed (calibration coefficient (A_z = 0.56))
- k‑K turbulence model (tumble/swirl‑aware, (C_K = 0.50))
- Metghalchi‑Keck laminar flame speed
- Sensitivity tests: spark timing, compression ratio, cam timing
Two independent calibration runs later converged to (A_z = 0.52) and (0.56). The final model predicts combustion solely from engine geometry — no per‑engine tuning required.
Result: 8.3 % HP MAPE, within 1 % of the previous curve‑fitted approach, but now it generalises to engines it hasn’t seen.
Why This Works
- Genuine Diversity – Eight agents independently tackling the same problem produce fundamentally different solutions, not just “eight slight variations of GPT’s first instinct.”
- Adversarial Review – Reviewers have every incentive to find flaws; they never review their own work.
- Synthesis Over Selection – Hybrids (“Take Plan C’s data structures, Plan A’s core algorithm, and Plan F’s error handling”) often outperform any single plan.
- Documented Reasoning – Each tournament yields ~100 pages of technical documents, preserving why a particular approach was chosen, complete with citations and quantitative comparisons.
The Numbers
Across 12 tournaments (combustion, knock, forced induction, VE/Helmholtz, exhaust tuning, heat transfer, friction, emiss…
The original content truncated here; continue as needed.
Statistics Overview
- Average plans per tournament: 8
- Average reviews per tournament: 8
- Judge agreement rate: 83 % unanimous, 17 % 2‑1 majority
- Zero second‑round judging required (all resolved on first pass)
- Physics errors caught by reviewers: 34 across all tournaments
- Overall engine count validated: 22 engines, 44 data points (HP + TQ each)
When NOT to Use This
This approach is overkill for:
- Simple features (e.g., add a CLI flag, fix a typo)
- Well‑understood problems with clear best practices
- Time‑critical fixes
Use it for
- Features where wrong physics = wrong results
- Architecture decisions that are expensive to reverse
- Anything where “good enough” isn’t good enough
Try It Yourself
The approach works with any AI capable of technical writing. The key ingredients are:
- Identical briefs – every planner gets the same information
- True isolation – planners don’t see each other’s work
- Cross‑review – reviewers see all plans, not just one
- Independent judging – judges don’t consult each other
- Preserved artifacts – keep everything for future reference
The PISTON codebase is at .
- 1,141 tests
- 22 validated engines
- All built through tournaments.
⚡