[Paper] 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks
Source: arXiv - 2605.08059v1
Overview
A new modular framework tackles the long‑standing challenge of 6‑DoF (degrees‑of‑freedom) object pose estimation by regressing keypoint heatmaps from RGB images and fusing depth when available. By pairing a lightweight YOLOv10m detector with a ResNet‑18‑based heatmap regressor, the authors push both accuracy and speed, achieving state‑of‑the‑art results on the LINEMOD benchmark.
Key Contributions
- Hybrid detection‑plus‑heatmap pipeline – YOLOv10m for fast object bounding‑box detection, followed by a ResNet‑18 network that outputs 2‑D heatmaps for a set of predefined object keypoints.
- Systematic keypoint selection study – Experiments comparing uniformly sampled surface points, 3‑D bounding‑box corners, and learned semantic points, quantifying their effect on pose error.
- RGB‑D cross‑fusion architecture – A multi‑stage feature‑exchange module that lets RGB and depth streams interact early and late, boosting pose accuracy without a massive parameter increase.
- Training tricks for robustness – Ablation of activation functions (ReLU vs. SiLU), cosine‑annealed learning‑rate schedules, and mixed‑precision training that collectively improve convergence.
- Open‑source implementation – Full code, pretrained weights, and scripts for reproducing results are released on GitHub.
Methodology
- Object Detection – A pretrained YOLOv10m model scans the scene and returns tight 2‑D bounding boxes for each target object.
- Heatmap Regression – Inside each box, a ResNet‑18 backbone processes the cropped RGB patch and predicts a set of heatmaps, one per keypoint. Peaks in the heatmaps give the 2‑D image coordinates of the keypoints.
- Pose Solving – The 2‑D keypoints are paired with their known 3‑D coordinates (from the object CAD model). A Perspective‑n‑Point (PnP) solver with RANSAC filters out outliers and returns the 6‑DoF pose (rotation + translation).
- RGB‑D Fusion (optional) – When depth is available, the RGB and depth streams are merged through a cross‑fusion block that concatenates and linearly mixes features at three depths of the network, allowing the model to leverage geometric cues early on.
All components are trained end‑to‑end (except the detector, which is frozen) using a combination of mean‑square error on heatmaps and a pose‑aware loss that penalizes large reprojection errors.
Results & Findings
| Model | Input | Mean ADD‑A (LINEMOD) | Inference time (ms) |
|---|---|---|---|
| RGB‑only (baseline) | RGB | 84.5 % | ~28 |
| RGB‑only (best activations) | RGB | 86.2 % | ~30 |
| RGB‑D Fusion | RGB + Depth | 92.4 % | ~35 |
- Keypoint selection matters – Using surface points that are well‑distributed across the object yields ~3 % higher ADD‑A than using only the 8 bounding‑box corners.
- Depth adds a big boost – The cross‑fusion design improves pose accuracy by ~8 % absolute while adding only ~7 ms overhead.
- Training tricks pay off – Switching to SiLU activation and cosine‑annealing learning rates contributed ~1.5 % gain in the RGB‑only model.
Overall, the approach matches or exceeds recent heavyweight methods (e.g., DenseFusion, PVN3D) while keeping the model size under 15 M parameters.
Practical Implications
- Robotics & Automation – The fast detection + lightweight heatmap regressor can run on edge devices (e.g., Jetson Nano) for real‑time bin‑picking or assembly line verification.
- AR/VR & Mixed Reality – Accurate 6‑DoF pose from a single RGB frame enables stable object anchoring for AR overlays without needing expensive depth sensors.
- Warehouse & Logistics – The optional RGB‑D fusion works well with commodity depth cameras (Intel RealSense, Azure Kinect), giving a plug‑and‑play upgrade path for existing vision pipelines.
- Developer-friendly – Because the pipeline builds on popular frameworks (YOLOv10m, PyTorch) and provides ready‑to‑run scripts, teams can integrate it into existing perception stacks with minimal code changes.
Limitations & Future Work
- Dataset scope – Evaluation is limited to the LINEMOD benchmark; performance on highly cluttered or texture‑less objects remains untested.
- Depth quality dependency – The fusion gains diminish with noisy or low‑resolution depth maps, suggesting a need for more robust depth preprocessing.
- Keypoint annotation overhead – Defining optimal keypoint sets per object still requires manual CAD processing; automating this step could broaden applicability.
The authors propose extending the framework to handle multiple objects simultaneously, exploring self‑supervised keypoint discovery, and benchmarking on larger, more diverse datasets (e.g., YCB‑Video, BOP).
Authors
- Ismail Aljosevic
- Amir Masoud Almasi
- Ana Parovic
- Ashkan Shafiei
Paper Information
- arXiv ID: 2605.08059v1
- Categories: cs.CV, cs.RO
- Published: May 8, 2026
- PDF: Download PDF