Level 0 3 Physics: 시리얼 프로토타입에서 병렬 매니폴드와 GPU 제약 솔버로

발행: 1개월 전 (2025년 12월 25일 오전 10:32 GMT+9)

15 분 소요

Source: Dev.to

TL;DR: 지난 일주일 동안 우리는 Bad Cat: Void Frontier의 물리 스택을 단순한 단일‑스레드 프로토타입에서 단계적이며 고도로 병렬화된 파이프라인으로 발전시켰습니다. 현재 스택에는

Level 1 – 작업 시스템에서 실행되는 CPU 폴백
Level 2 – 캐시된 매니폴드를 활용한 워밍‑스타트 반복 솔버
Level 3 – 병렬 매니폴드 생성 + GPU‑기반 제약 해결

왜 단계별 물리 로드맵인가? 💡

Game physics는 광범위한 설계 공간을 가지고 있습니다. 우리는 실용적인 결과를 빠르게 얻고 향후 확장을 가능하게 하기 위해 점진적인 레벨 접근 방식을 채택했습니다:

레벨	설명
레벨 0 (데모 / 기준)	변환, 충돌 및 데모 자산을 검증하기 위한 간단한 씬 (`level_0`).
레벨 1 (CPU 폴백 + 작업 시스템)	분리된 파이프라인 단계와 병렬 narrow‑phase를 갖춘 결정론적 고정 타임스텝 시뮬레이션.
레벨 2 (반복 제약 솔버 + 워밍‑스타트)	캐시된 매니폴드와 워밍‑스타트 임펄스를 사용하여 더 빠른 수렴과 안정성을 제공.
레벨 3 (병렬 매니폴드 + GPU 솔버)	매우 높은 접촉 부하를 위한 Compute‑shader 기반 제약 해결.

이 단계적 접근 방식은 빠른 반복, 견고한 테스트, 그리고 각 단계마다 명확한 성능 목표를 가능하게 했습니다.

빠른 아키텍처 개요 🔧

핵심 단계

Broadphase – 공간 그리드가 후보 쌍을 생성합니다.
Parallel Narrowphase – 작업 시스템이 후보 쌍을 분할하고; 각 작업은 로컬 매니폴드를 생성하여 일괄적으로 추가합니다.
Manifold Cache / Warm‑Start (Level 2) – 새로운 매니폴드를 캐시된 매니폴드와 매치하고 워밍 스타트 임펄스를 적용합니다.
Constraint Solver –
- Level 1/2는 반복적인 (sequential‑impulse) 솔버를 사용합니다.
- Level 3은 접촉 처리를 결정론적 컴퓨트 셰이더로 오프로드합니다.

Level 1 — CPU 폴백 및 작업 시스템 🔁

목표: 결정론적 고정 타임스텝 물리와 CPU에서 확장 가능한 병렬 좁은 단계.

우리가 구현한 내용

고정 타임스텝 통합 (TimingSystem이 1/60 s 물리 스텝을 제공).
쌍 개수를 제한하기 위한 브로드페이즈 공간 그리드.
병렬 좁은 단계를 작업(physics_job.cpp)으로 구현: 각 워커가 쌍 슬라이스를 처리하고 로컬 std::vector를 만든 뒤, 뮤텍스 아래에서 공유 manifolds_에 추가.

스니펫 (개념적)

// Worker‑local: gather manifolds (reserve to reduce reallocations)
std::vector<CollisionManifold> local_manifolds;
local_manifolds.reserve((chunk_end - chunk_start) / 8 + 4);

for (auto& pair : slice) {
    CollisionManifold m;
    if (check_collision(pair, m))
        local_manifolds.push_back(m);
}

// Bulk append under lock (manifold_mutex_ in PhysicsSystem)
{
    std::lock_guard<std::mutex> lock(manifold_mutex_);
    manifolds_.insert(manifolds_.end(),
                     local_manifolds.begin(),
                     local_manifolds.end());
}

왜 이것이 작동하는가

로컬 누적은 빈번한 동기화와 메모리 할당 급증을 방지한다(우리는 휴리스틱하게 reserve를 사용).
대량 병합은 락 경쟁을 낮게 유지하며, 작업은 진단을 위해 manifolds_generated를 기록한다.
공유 벡터와 뮤텍스는 PhysicsJobContext를 통해 노출된다(physics_job.cpp 참조).
우리의 구현에서는 ctx.manifolds와 ctx.manifold_mutex가 각 작업에 전달되어 안전한 대량 병합을 수행한다(핫 경로에서 원자 연산을 피함).

Level 2 — Cached manifolds & iterative solvers (warm‑starting) ♻️

Level 2 focuses on contact stability and solver efficiency.

Main features

Feature	Description
CachedManifold	고정 크기 컨테이너(`MAX_CONTACTS_PER_MANIFOLD = 4`)가 `EntityPairKey`를 키로 하는 `ManifoldCache`에 저장됩니다.
Warm‑starting	이전 프레임의 impulse 히스토리를 재사용하고 스케일된 impulse를 미리 적용해 수렴 속도를 높입니다. `warm_start_manifold()`에 구현되어 있으며, `warm_start_factor_`(기본값 0.8, 0.0–1.0 범위 제한)으로 제어됩니다.
Iterative solver	속도 수준 순차 impulse 루프가 `solver_iterations_`(기본값 8, 1–16 범위 제한) 동안 실행되며, `velocity_iterations_`(기본값 4)와 `position_iterations_`(기본값 2) 단계가 포함됩니다.
Pruning & stats	3 프레임 후에 오래된 매니폴드가 `prune_stale_manifolds(3)`을 통해 정리됩니다. Warm‑start 재사용은 `warm_start_hits_` / `warm_start_misses_`로 추적되며, 타이밍은 `stage_timings_accum_.manifold_cache_us`와 `stage_timings_accum_.warm_start_us`에 기록됩니다.

이 기본값들은 docs/specs/engine/systems/physics/constraint_solver.md에 문서화되어 있습니다. 이러한 선택은 안정성과 CPU 비용의 균형을 맞추어, 정지 접촉 동작을 개선하고 쌓인 물체 및 복잡한 장면에서 더 빠른 수렴을 제공합니다.

Level 3 — 병렬 매니폴드 및 GPU 제약 조건 해결 ⚡️

매우 높은 접촉 상황(파괴 가능한 더미, 혼잡한 장면)에서는 CPU 솔버가 병목이 됩니다. Level 3은 제약 조건 처리를 병렬화하고 필요에 따라 솔버를 GPU로 옮겨 이 문제를 해결합니다.

두 가지 보완적인 접근 방식

CPU에서의 병렬 제약 조건 처리
- 매니폴드를 분할하고 가능한 경우 독립적인 접촉 해결을 병렬로 실행합니다.
- 바디 쓰기 충돌을 줄이기 위해 공간/소유권 휴리스틱을 사용하거나, 충돌이 적은 경우 원자적 업데이트를 사용합니다.
GPU 컴퓨트‑셰이더 솔버
- 접촉을 SSBO에 패킹하고, 결정론적 고정‑점 컴퓨트 셰이더를 실행해 충격량을 계산하고 원자적 업데이트를 통해 바디 누산기에 적용합니다.
- M6 연구 노트에는 프로토타입 컴퓨트 셰이더가 포함되어 있으며, 결정론적 원자적 누적 및 고정‑점 방법에 대해 논의합니다(docs/research/M6_COMPREHENSIVE_RESEARCH.md).

예시 GLSL 스니펫 (개념적)

// per‑contact work item (fixed‑point arithmetic for determinism)
Contact c = contacts[gid];
int rel_vel = compute_relative_velocity_fixed(c);
int impulse = compute_impulse_fixed(c, rel_vel);

// Apply impulse atomically to the bodies involved
atomicAdd(body_impulses[c.bodyA].linear, impulse * c.normal);
atomicAdd(body_impulses[c.bodyB].linear, -impulse * c.normal);

GPU 경로는 10 k 이상의 접촉을 가진 작업에서 2–4배의 속도 향상을 제공하며, CPU‑병렬 경로는 성능이 좋은 GPU가 없는 하드웨어에서도 부드러운 대체 옵션을 제공합니다.

교훈 및 다음 단계

교훈	요점
로컬 배칭이 아이템별 잠금보다 우수	공간을 예약하고 일괄 병합하면 뮤텍스 경쟁이 크게 감소합니다.
워밍‑스타트는 안정성에 필수	적당한 워밍‑스타트 팩터(0.8)만으로도 정지된 더미에서 솔버 반복 횟이를 약 30 % 줄일 수 있습니다.
결정론성 vs. 성능 트레이드‑오프	고정‑소수점 연산과 결정론적 원자 연산을 사용하면 GPU 결과를 프레임 및 하드웨어 간에 재현 가능하게 유지합니다.
캐시 지역성이 중요	매니폴드를 연속된 캐시(구조체 벡터)로 저장하면 최신 CPU에서 협소‑단계 처리량이 향상됩니다.

다음 단계

CPU‑병렬 솔버를 위한 충돌‑해결 휴리스틱을 다듬기.
접촉 수에 따라 CPU와 GPU 경로를 자동 전환하도록 프로파일링 훅 추가.
GPU 솔버를 확장해 마찰과 반발을 한 번에 처리하도록 구현.

레포지토리의 engine/physics/ 디렉터리에서 모든 참조 코드를 확인할 수 있습니다. 질문이나 기여가 있으면 PR이나 이슈를 자유롭게 열어 주세요!

// Compute impulse in fixed‑point arithmetic
compute_impulse_fixed(c, rel_vel);

// Deterministic atomic addition into per‑body accumulators
apply_impulse_atomic(c.bodyA, impulse);
apply_impulse_atomic(c.bodyB, -impulse);

참고: 연구 초안에는 레이아웃 패킹, 원자 누적, 재생 및 크로스‑플랫폼 검증을 위한 결정론적 고려 사항에 대한 세부 정보가 포함되어 있습니다.

장점

수천 개의 연락처에 대한 대규모 병렬 처리.
결정론적 고정소수점 연산은 일관된 재생을 보장합니다.

Trade‑offs & safeguards

바디 누산기에 대한 원자적 업데이트는 안정성을 유지하기 위해 결정적이며 제한되어야 합니다.
워밍‑스타팅과 매니폴드당 사전 필터링은 여전히 GPU에 전송되는 중복 접촉 작업을 줄이기 위해 사용됩니다.

Performance — targets & results 📊

Target:  50 % reduction in solver work for static stacked scenes; our runs show a typical **30 %–60 %** reduction in iterations and wall‑time depending on the scene.

GPU offload: constraint offload to GPU can give > 5× speed‑up in high‑contact scenes, provided atomic‑accumulation semantics and fixed‑point scaling are tuned for deterministic behavior.

How to tune (config keys)

Key	Description	Default	Range
`physics.solver.iterations`	Overall solver iterations	8	1 – 16
`physics.solver.velocity_iterations`	Velocity‑level iterations	4	1 – 16
`physics.solver.position_iterations`	Position‑correction iterations	2	0 – 8
`physics.solver.warm_start_factor`	Warm‑start scale	0.8	0.0 – 1.0

These keys are read by PhysicsSystem::init() (see physics_system.cpp) and clamped to safe ranges during initialization. Use the debug UI to monitor Manifolds:, WarmHits: and WarmMiss: counts while tuning.

Lessons learned & best practices ✅

Stage your physics design: build correctness in Level 1 first, then add warm‑starting and caching, and finally parallel/GPU paths. → 물리 설계를 단계별로 진행하세요: 먼저 Level 1에서 정확성을 구축하고, 그 다음 워밍 스타트와 캐싱을 추가하며, 마지막으로 병렬/GPU 경로를 구현합니다.
Keep narrow‑phase parallelism worker‑local and minimize synchronization with bulk merges. → 좁은 단계 병렬성을 작업자 로컬로 유지하고 대량 병합으로 인한 동기화를 최소화합니다.
Use fixed‑point math for GPU solvers to make behavior reproducible across platforms. → GPU 솔버에 고정소수점 연산을 사용하여 플랫폼 간 동작을 재현 가능하게 합니다.
Warm‑starting pays off strongly in stacked/stable scenarios. → 워밍 스타트는 중첩되거나 안정적인 시나리오에서 큰 효과를 발휘합니다.
Instrument manifolds and solver stats aggressively: we surface manifold counts in the debug UI and log warm‑start hits/misses. Physics timing uses SDL_GetPerformanceCounter() and helpers (e.g., sdl_elapsed_us) and accumulates stage timings in stage_timings_accum_.manifold_cache_us and stage_timings_accum_.warm_start_us for profiling. → 매니폴드와 솔버 통계를 적극적으로 계측합니다: 디버그 UI에 매니폴드 개수를 표시하고 워밍 스타트 적중/실패를 로그에 기록합니다. 물리 타이밍은 SDL_GetPerformanceCounter()와 헬퍼(예: sdl_elapsed_us)를 사용하며, 프로파일링을 위해 stage_timings_accum_.manifold_cache_us와 stage_timings_accum_.warm_start_us에 단계별 타이밍을 누적합니다.

검증된 코드 포인터 🔎

문서의 진술은 다음 코드 위치와 문서에 대해 검증되었습니다:

Parallel narrow‑phase / job logic: engine/systems/physics/physics_job.cpp (process_pair_and_append, local_manifolds, manifold_mutex_ 아래의 대량 병합).
Manifold cache & warm‑start: engine/systems/physics/physics_system.cpp (update_manifold_cache(), warm_start_manifolds(), prune_stale_manifolds()).
Solver loop and iteration clamping: engine/systems/physics/physics_system.cpp (솔버 반복 루프, solver_iterations_, velocity_iterations_, position_iterations_ 및 클램핑 로직).
Config keys read in PhysicsSystem::init(): physics.solver.iterations, physics.solver.warm_start_factor, physics.solver.velocity_iterations, physics.solver.position_iterations.
Timing / instrumentation: stage_timings_accum_ 필드와 sdl_elapsed_us 래퍼를 사용하여 매니폴드 캐시 및 워밍‑스타트 시간을 측정.
Constraint & solver math: docs/specs/engine/systems/physics/constraint_solver.md 및 docs/specs/engine/systems/physics/physics_math.md.

이러한 참조는 재현성을 위해 문서에 적절히 인라인으로 포함되었습니다.

다음 단계 🎯

GPU 솔버의 원자 전략 및 결정적 누적을 계속 튜닝합니다.
하이브리드 스케줄링을 탐색합니다 (CPU는 저접촉 쌍을 처리하고, GPU는 대량 접촉을 처리).
CPU/GPU 경로 간 결정성을 검증하기 위한 크로스 플랫폼 검증 하네스를 추가합니다.

감사의 글

이번 주에 빠르고 집중된 작업을 해준 팀에 감사드립니다 — CPU와 GPU 경로 모두를 반복하고, 플레이 테스트를 위해 워밍 스타트와 매니폴드 캐싱을 제때 구현했습니다.

작성자: Bad Cat Engine Team — Bad Cat: Void Frontier

태그: #gamedev #physics #cpp #vulkan #parallelism #simulation

Level 0 3 Physics: 시리얼 프로토타입에서 병렬 매니폴드와 GPU 제약 솔버로

왜 단계별 물리 로드맵인가? 💡

빠른 아키텍처 개요 🔧

Level 1 — CPU 폴백 및 작업 시스템 🔁

우리가 구현한 내용

스니펫 (개념적)

왜 이것이 작동하는가

Level 2 — Cached manifolds & iterative solvers (warm‑starting) ♻️

Main features

Level 3 — 병렬 매니폴드 및 GPU 제약 조건 해결 ⚡️

두 가지 보완적인 접근 방식

예시 GLSL 스니펫 (개념적)

교훈 및 다음 단계

다음 단계

장점

Trade‑offs & safeguards

Performance — targets & results 📊

How to tune (config keys)

Lessons learned & best practices ✅

검증된 코드 포인터 🔎

다음 단계 🎯

감사의 글

관련 글

Libgodc: Sega Dreamcast용 Go 프로그램 작성

Unity의 Mono 문제: C# 코드가 예상보다 느리게 실행되는 이유

🎮 레트로 행맨 '95 KIRO 사용

🕹️ Game Designer 또는 Game Developer? 너무 일찍 결정하지 마세요

왜 단계별 물리 로드맵인가? 💡

빠른 아키텍처 개요 🔧

Level 1 — CPU 폴백 및 작업 시스템 🔁

우리가 구현한 내용

스니펫 (개념적)

왜 이것이 작동하는가

Level 2 — Cached manifolds & iterative solvers (warm‑starting) ♻️

Main features

Level 3 — 병렬 매니폴드 및 GPU 제약 조건 해결 ⚡️

두 가지 보완적인 접근 방식

예시 GLSL 스니펫 (개념적)

교훈 및 다음 단계

다음 단계

장점

Trade‑offs & safeguards

Performance — targets & results 📊

How to tune (config keys)

Lessons learned & best practices ✅

검증된 코드 포인터 🔎

다음 단계 🎯

감사의 글

관련 글

Libgodc: Sega Dreamcast용 Go 프로그램 작성

Unity의 Mono 문제: C# 코드가 예상보다 느리게 실행되는 이유

🎮 레트로 행맨 '95 KIRO 사용

🕹️ Game Designer 또는 Game Developer? 너무 일찍 결정하지 마세요

Level 1 — CPU 폴백 및 작업 시스템 🔁

Level 2 — Cached manifolds & iterative solvers (warm‑starting) ♻️

Level 3 — 병렬 매니폴드 및 GPU 제약 조건 해결 ⚡️