[Paper] COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation
Source: arXiv - 2601.09698v1
Overview
The paper introduces COMPOSE, a new framework for reconstructing 3D human poses from a handful of camera views. By treating the cross‑view matching problem as a hypergraph partitioning task instead of relying on fragile pairwise links, COMPOSE dramatically improves the robustness and accuracy of multi‑view 3D pose estimation—crucial for applications such as sports analytics, AR/VR, and human‑robot interaction.
Key Contributions
- Hypergraph formulation: Recasts multi‑view keypoint association as a hypergraph partitioning problem, capturing global consistency across any number of views in a single optimization.
- Geometric pruning: Introduces a fast, geometry‑driven pruning step that cuts down the exponential search space of the integer linear program, making the method practical for real‑time pipelines.
- State‑of‑the‑art performance: Demonstrates up to 23 % higher average precision than prior optimization‑based approaches and up to 11 % over recent self‑supervised deep models on standard multi‑view benchmarks.
- Modular pipeline: Works as a drop‑in replacement for the association stage in existing 2‑D‑to‑3‑D pipelines, requiring only the 2‑D keypoint detections as input.
Methodology
- 2‑D detection: Off‑the‑shelf 2‑D pose detectors (e.g., HRNet, OpenPose) provide keypoint locations in each camera view.
- Hypergraph construction:
- Each node represents a 2‑D detection.
- A hyperedge connects detections from all views that could belong to the same 3‑D joint, encoding a multi‑way correspondence rather than a simple pairwise link.
- Geometric feasibility check: Before forming a hyperedge, the algorithm checks whether the detections are geometrically consistent (e.g., epipolar constraints, triangulation error below a threshold). This step prunes impossible combinations early.
- Integer Linear Programming (ILP): The hypergraph is partitioned by solving an ILP that selects a set of hyperedges covering every detection exactly once while minimizing a cost derived from detection confidence and reprojection error.
- Triangulation: The selected hyperedges directly yield consistent multi‑view correspondences, which are triangulated to obtain the final 3‑D joint coordinates.
The key insight is that by optimizing globally over hyperedges, the method enforces cycle consistency automatically, eliminating the cascade of errors typical in pairwise matching.
Results & Findings
| Dataset | Baseline (pairwise) | COMPOSE | Δ AP (↑) |
|---|---|---|---|
| Campus (4 views) | 71.2 % | 84.5 % | +13.3 % |
| Shelf (5 views) | 68.9 % | 92.1 % | +23.2 % |
| CMU Panoptic (8 views) | 78.4 % | 89.7 % | +11.3 % |
- Robustness to outliers: When synthetic noise is added to 2‑D detections, COMPOSE degrades gracefully, maintaining >80 % AP even with 30 % false detections.
- Runtime: After pruning, the ILP solves in ~120 ms for 5 views and ~250 ms for 8 views on a modern CPU, fitting comfortably into many offline or near‑real‑time pipelines.
- Ablation: Removing hypergraph constraints (i.e., reverting to pairwise) drops performance by 9–15 % AP, confirming the importance of global consistency.
Practical Implications
- Plug‑and‑play for existing systems: Developers can keep their favorite 2‑D detectors and simply swap in COMPOSE for the matching stage, gaining a sizable boost without retraining a full end‑to‑end model.
- Edge‑device feasibility: The geometric pruning step is lightweight and can be executed on embedded GPUs or even CPUs, making multi‑camera setups on robots or AR headsets more reliable.
- Reduced annotation burden: Since COMPOSE works with sparse views and does not require large amounts of 3‑D ground truth for training, it’s attractive for studios or labs that can only afford a few calibrated cameras.
- Improved downstream tasks: More accurate 3‑D poses translate directly into better action recognition, motion capture for animation, and safer human‑robot collaboration where precise joint locations are safety‑critical.
Limitations & Future Work
- Scalability to very large camera networks: Although pruning mitigates the exponential ILP growth, the method still becomes slower as the number of views exceeds ~10–12.
- Dependence on calibration: Accurate intrinsic/extrinsic parameters are assumed; errors in calibration can hurt the geometric feasibility checks.
- Static scene assumption: The current formulation does not handle dynamic camera rigs (e.g., moving drones) without re‑computing the hypergraph on the fly.
Future research directions suggested by the authors:
- Learning a data‑driven pruning model to further accelerate hyperedge generation.
- Extending the hypergraph framework to jointly estimate camera poses and human pose (self‑calibration).
- Integrating temporal consistency across frames to handle fast motions and occlusions more robustly.
Authors
- Tony Danjun Wang
- Tolga Birdal
- Nassir Navab
- Lennart Bastian
Paper Information
- arXiv ID: 2601.09698v1
- Categories: cs.CV
- Published: January 14, 2026
- PDF: Download PDF