[Paper] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss
Source: arXiv - 2511.21575v1
Overview
This paper tackles a real‑world snag in computer‑assisted orthopaedic surgery: automatically finding anatomical landmarks in pelvic fluoroscopy when the X‑ray view isn’t perfectly aligned. By weaving a 2D/3D registration loss into a classic U‑Net detector, the authors show that landmark accuracy stays high even when the patient or C‑arm is rotated—something current models struggle with.
Key Contributions
- Hybrid training loss: Introduces a Pose‑Estimation Loss that penalizes inconsistencies between predicted 2D landmarks and their 3‑D counterparts projected into the image plane.
- Robust U‑Net pipeline: Extends the standard U‑Net landmark predictor with the new loss, yielding a model that adapts to arbitrary pelvis orientations.
- Comprehensive evaluation: Benchmarks three setups—baseline U‑Net, U‑Net + Pose‑Loss (trained from scratch), and U‑Net fine‑tuned with Pose‑Loss—under simulated intra‑operative pose variations.
- Open‑source potential: Provides enough implementation detail (loss formulation, data augmentation, registration pipeline) for reproducibility and integration into existing surgical navigation stacks.
Methodology
-
Data preparation
- 3‑D pelvic CT scans are paired with synthetic 2‑D fluoroscopic projections covering a wide range of rotations (±30° in pitch, yaw, roll).
- Ground‑truth 3‑D landmark coordinates are known; their 2‑D projections become the training targets.
-
Base model
- A vanilla U‑Net takes a single fluoroscopic frame and outputs heatmaps for each landmark.
-
Pose‑Estimation Loss
- After the U‑Net predicts 2‑D heatmaps, the peak locations are extracted.
- These 2‑D points are back‑projected into 3‑D space using the known imaging geometry, yielding an estimated 3‑D pose.
- The loss combines:
- Heatmap regression loss (L2 between predicted and ground‑truth heatmaps)
- Registration loss (L2 between the estimated 3‑D landmarks and the true 3‑D landmarks after applying the current pose).
- The registration term forces the network to learn pose‑invariant features, because any mismatch in 3‑D space is directly penalized.
-
Training regimes
- Baseline: U‑Net trained with heatmap loss only.
- From‑scratch Pose: Same architecture but with the combined loss from epoch 0.
- Fine‑tuned Pose: Baseline model further trained with the combined loss for a few epochs.
-
Evaluation
- Mean Euclidean distance (MED) between predicted and true 2‑D landmarks across a held‑out test set with random poses.
- Success rate under a clinically‑relevant error threshold (≤ 2 mm).
Results & Findings
| Model | MED (mm) | % ≤ 2 mm |
|---|---|---|
| Baseline U‑Net | 3.9 | 68% |
| U‑Net + Pose‑Loss (scratch) | 2.7 | 81% |
| U‑Net fine‑tuned with Pose‑Loss | 2.5 | 84% |
- Adding the registration loss cuts the average landmark error by ~35 % compared to the baseline.
- Fine‑tuning yields the best trade‑off: the network retains its learned visual features while gaining pose robustness.
- Qualitative visualizations show the model correctly tracks landmarks even when the pelvis is tilted 30°—a scenario where the baseline often drifts or collapses.
Practical Implications
- Surgical navigation: Surgeons can rely on automated landmarking without pausing to re‑align the C‑arm, reducing operative time and radiation exposure.
- Software integration: The loss function is framework‑agnostic (implemented in PyTorch/TensorFlow), making it easy to drop into existing U‑Net‑based pipelines used in OR navigation suites.
- Generalization: The same 2D/3D registration loss can be repurposed for other anatomical regions (spine, knee) where intra‑operative view variability is common.
- Edge devices: Because the underlying model stays a lightweight U‑Net, inference can run on GPU‑accelerated workstations or even on‑device inference cards, enabling real‑time feedback.
Limitations & Future Work
- Synthetic pose distribution: The study relies on simulated fluoroscopic angles; real‑world data may exhibit more complex distortions (e.g., patient motion, metal artifacts).
- Single‑view assumption: Only one fluoroscopic image is processed at a time; extending to multi‑view fusion could further boost accuracy.
- Calibration dependency: Accurate 2D/3D registration needs precise knowledge of the imaging geometry, which may drift in the OR. Future work could incorporate self‑calibration or learnable projection models.
- Clinical validation: The authors plan prospective trials on live surgeries to confirm that the error reductions translate into measurable workflow improvements.
Authors
- Chou Mo
- Yehyun Suh
- J. Ryan Martin
- Daniel Moyer
Paper Information
- arXiv ID: 2511.21575v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF