Level 0 3 Physics: From Serial Prototypes to Parallel Manifolds and GPU Constraint Solvers

Published: (December 24, 2025 at 08:32 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

TL;DR: Over the last week we advanced the physics stack for Bad Cat: Void Frontier from simple, single‑threaded prototypes to a staged, highly‑parallel pipeline. The stack now includes

  • Level 1 – CPU fallback running on the Job System
  • Level 2 – Warm‑started iterative solvers with cached manifolds
  • Level 3 – Parallel manifold generation + GPU‑based constraint solve

Why a staged physics roadmap? 💡

Game physics is a wide design space. We adopted a progressive‑level approach to get practical results quickly while enabling future scale:

LevelDescription
Level 0 (Demo / Baseline)Simple scene (level_0) to validate transforms, collisions, and demo assets.
Level 1 (CPU fallback + Job System)Deterministic fixed‑timestep simulation with decoupled pipeline stages and parallel narrow‑phase.
Level 2 (Iterative constraint solver + Warm‑starting)Cached manifolds, warm‑start impulses for faster convergence and stability.
Level 3 (Parallel manifolds + GPU solver)Compute‑shader‑driven constraint solving for very high‑contact workloads.

This staged approach allowed rapid iteration, robust testing, and clear performance goals at each step.

Quick architecture overview 🔧

Key stages

  1. Broadphase – Spatial grid produces candidate pairs.
  2. Parallel Narrowphase – Job System partitions candidate pairs; each job generates local manifolds and appends them in bulk.
  3. Manifold Cache / Warm‑Start (Level 2) – Match new manifolds against cached ones and apply warm‑start impulses.
  4. Constraint Solver
    • Levels 1/2 use an iterative (sequential‑impulse) solver.
    • Level 3 offloads contact processing to a deterministic compute shader.

Level 1 — CPU fallback & Job System 🔁

Goals: deterministic fixed‑timestep physics and a parallel narrow‑phase that scales on CPU.

What we implemented

  • Fixed‑timestep integration (TimingSystem supplies a 1/60 s physics step).
  • Broadphase spatial grid to limit pair counts.
  • Parallel narrowphase implemented as a Job (physics_job.cpp): each worker processes a slice of pairs, builds a local std::vector, and appends it to the shared manifolds_ under a mutex.

Snippet (conceptual)

// Worker‑local: gather manifolds (reserve to reduce reallocations)
std::vector<CollisionManifold> local_manifolds;
local_manifolds.reserve((chunk_end - chunk_start) / 8 + 4);

for (auto& pair : slice) {
    CollisionManifold m;
    if (check_collision(pair, m))
        local_manifolds.push_back(m);
}

// Bulk append under lock (manifold_mutex_ in PhysicsSystem)
{
    std::lock_guard<std::mutex> lock(manifold_mutex_);
    manifolds_.insert(manifolds_.end(),
                     local_manifolds.begin(),
                     local_manifolds.end());
}

Why this works

  • Local accumulation avoids frequent synchronization and allocation churn (we reserve heuristically).
  • Bulk merge keeps lock contention low; the job records manifolds_generated for diagnostics.
  • The shared vector and mutex are exposed via PhysicsJobContext (see physics_job.cpp).
  • In our implementation ctx.manifolds and ctx.manifold_mutex are passed to each job to perform a safe bulk merge (atomic ops avoided in the hot path).

Level 2 — Cached manifolds & iterative solvers (warm‑starting) ♻️

Level 2 focuses on contact stability and solver efficiency.

Main features

FeatureDescription
CachedManifoldFixed‑size container (MAX_CONTACTS_PER_MANIFOLD = 4) stored in a ManifoldCache keyed by EntityPairKey.
Warm‑startingReuse impulse history from previous frames and pre‑apply scaled impulses to speed convergence. Implemented in warm_start_manifold(). Controlled by warm_start_factor_ (default 0.8, clamped 0.0–1.0).
Iterative solverVelocity‑level sequential‑impulse loop runs for solver_iterations_ (default 8, clamped 1–16) with velocity_iterations_ (default 4) and position_iterations_ (default 2) phases.
Pruning & statsStale manifolds are pruned after 3 frames (prune_stale_manifolds(3)). Warm‑start reuse is tracked via warm_start_hits_ / warm_start_misses_. Timing is recorded in stage_timings_accum_.manifold_cache_us and stage_timings_accum_.warm_start_us.

These defaults are documented in docs/specs/engine/systems/physics/constraint_solver.md. The choices balance stability and CPU cost, delivering better resting‑contact behavior and faster convergence for stacked objects and complex scenes.

Level 3 — Parallel manifolds & GPU constraint solve ⚡️

For very high‑contact scenarios (destructible piles, crowded scenes) the CPU solver becomes a bottleneck. Level 3 targets that by parallelizing constraint processing and optionally moving the solver to the GPU.

Two complementary approaches

  1. Parallel constraint processing on CPU

    • Partition manifolds and run independent contact solves in parallel where possible.
    • Use spatial/ownership heuristics to reduce body‑write conflicts, or atomic updates for low‑contention cases.
  2. GPU compute‑shader solver

    • Pack contacts into an SSBO and run a deterministic fixed‑point compute shader that computes impulses and applies them via atomic updates on body accumulators.
    • The M6 research notes contain a prototype compute shader and discuss deterministic atomic accumulation and fixed‑point methods (docs/research/M6_COMPREHENSIVE_RESEARCH.md).

Example GLSL snippet (conceptual)

// per‑contact work item (fixed‑point arithmetic for determinism)
Contact c = contacts[gid];
int rel_vel = compute_relative_velocity_fixed(c);
int impulse = compute_impulse_fixed(c, rel_vel);

// Apply impulse atomically to the bodies involved
atomicAdd(body_impulses[c.bodyA].linear, impulse * c.normal);
atomicAdd(body_impulses[c.bodyB].linear, -impulse * c.normal);

The GPU path yields a 2–4× speed‑up on workloads with > 10 k contacts, while the CPU‑parallel path provides a smoother fallback on hardware without a capable GPU.

Lessons learned & next steps

LessonTakeaway
Local batching beats per‑item lockingReserving space and bulk‑merging reduces mutex contention dramatically.
Warm‑starting is essential for stabilityEven a modest warm‑start factor (0.8) cuts solver iterations by ~30 % on resting piles.
Determinism vs. performance trade‑offFixed‑point arithmetic and deterministic atomics keep GPU results reproducible across frames and hardware.
Cache locality mattersStoring manifolds in a contiguous cache (vector of structs) improves narrow‑phase throughput on modern CPUs.

Next steps

  • Refine conflict‑resolution heuristics for the CPU‑parallel solver.
  • Add profiling hooks to automatically switch between CPU and GPU paths based on contact count.
  • Extend the GPU solver to handle friction and restitution in a single pass.

All code referenced is available in the engine/physics/ directory of the repository. Feel free to open a PR or issue for any questions or contributions!

// Compute impulse in fixed‑point arithmetic
compute_impulse_fixed(c, rel_vel);

// Deterministic atomic addition into per‑body accumulators
apply_impulse_atomic(c.bodyA, impulse);
apply_impulse_atomic(c.bodyB, -impulse);

Note: The research draft contains details on layout packing, atomic accumulation, and deterministic considerations for replay and cross‑platform validation.

Benefits

  • Massive parallelism for thousands of contacts.
  • Deterministic fixed‑point arithmetic ensures consistent replays.

Trade‑offs & safeguards

  • Atomic updates on body accumulators must be deterministic and bounded to preserve stability.
  • Warm‑starting and per‑manifold pre‑filtering are still used to reduce redundant contact work sent to the GPU.

Performance — targets & results 📊

Target:  50 % reduction in solver work for static stacked scenes; our runs show a typical **30 %–60 %** reduction in iterations and wall‑time depending on the scene.

  • GPU offload: constraint offload to GPU can give > 5× speed‑up in high‑contact scenes, provided atomic‑accumulation semantics and fixed‑point scaling are tuned for deterministic behavior.

How to tune (config keys)

KeyDescriptionDefaultRange
physics.solver.iterationsOverall solver iterations81 – 16
physics.solver.velocity_iterationsVelocity‑level iterations41 – 16
physics.solver.position_iterationsPosition‑correction iterations20 – 8
physics.solver.warm_start_factorWarm‑start scale0.80.0 – 1.0

These keys are read by PhysicsSystem::init() (see physics_system.cpp) and clamped to safe ranges during initialization. Use the debug UI to monitor Manifolds:, WarmHits: and WarmMiss: counts while tuning.

Lessons learned & best practices ✅

  • Stage your physics design: build correctness in Level 1 first, then add warm‑starting and caching, and finally parallel/GPU paths.
  • Keep narrow‑phase parallelism worker‑local and minimize synchronization with bulk merges.
  • Use fixed‑point math for GPU solvers to make behavior reproducible across platforms.
  • Warm‑starting pays off strongly in stacked/stable scenarios.
  • Instrument manifolds and solver stats aggressively: we surface manifold counts in the debug UI and log warm‑start hits/misses. Physics timing uses SDL_GetPerformanceCounter() and helpers (e.g., sdl_elapsed_us) and accumulates stage timings in stage_timings_accum_.manifold_cache_us and stage_timings_accum_.warm_start_us for profiling.

Verified code pointers 🔎

The article statements were verified against these code locations and docs:

  • Parallel narrow‑phase / job logic: engine/systems/physics/physics_job.cpp (process_pair_and_append, local_manifolds, bulk merge under manifold_mutex_).
  • Manifold cache & warm‑start: engine/systems/physics/physics_system.cpp (update_manifold_cache(), warm_start_manifolds(), prune_stale_manifolds()).
  • Solver loop and iteration clamping: engine/systems/physics/physics_system.cpp (solver iterations loop, solver_iterations_, velocity_iterations_, position_iterations_ and clamping logic).
  • Config keys read in PhysicsSystem::init(): physics.solver.iterations, physics.solver.warm_start_factor, physics.solver.velocity_iterations, physics.solver.position_iterations.
  • Timing / instrumentation: stage_timings_accum_ fields and sdl_elapsed_us wrappers used to measure manifold cache & warm‑start times.
  • Constraint & solver math: docs/specs/engine/systems/physics/constraint_solver.md and docs/specs/engine/systems/physics/physics_math.md.

These references are included inline where appropriate in the article for reproducibility.

Next steps 🎯

  • Continue tuning the GPU solver’s atomic strategy and deterministic accumulation.
  • Explore hybrid scheduling (CPU handles low‑contact pairs, GPU handles bulk contacts).
  • Add a cross‑platform validation harness for determinism between CPU/GPU paths.

Acknowledgements

Thanks to the team for the rapid, focused work this week — iterating on both CPU and GPU paths and landing warm‑starting and manifold caching in time for playtests.

Author: Bad Cat Engine Team — Bad Cat: Void Frontier

Tags: #gamedev #physics #cpp #vulkan #parallelism #simulation

Back to Blog

Related posts

Read more »