[Paper] 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Published: 3 days ago (May 8, 2026 at 01:47 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08059v1

Overview

A new modular framework tackles the long‑standing challenge of 6‑DoF (degrees‑of‑freedom) object pose estimation by regressing keypoint heatmaps from RGB images and fusing depth when available. By pairing a lightweight YOLOv10m detector with a ResNet‑18‑based heatmap regressor, the authors push both accuracy and speed, achieving state‑of‑the‑art results on the LINEMOD benchmark.

Key Contributions

Hybrid detection‑plus‑heatmap pipeline – YOLOv10m for fast object bounding‑box detection, followed by a ResNet‑18 network that outputs 2‑D heatmaps for a set of predefined object keypoints.
Systematic keypoint selection study – Experiments comparing uniformly sampled surface points, 3‑D bounding‑box corners, and learned semantic points, quantifying their effect on pose error.
RGB‑D cross‑fusion architecture – A multi‑stage feature‑exchange module that lets RGB and depth streams interact early and late, boosting pose accuracy without a massive parameter increase.
Training tricks for robustness – Ablation of activation functions (ReLU vs. SiLU), cosine‑annealed learning‑rate schedules, and mixed‑precision training that collectively improve convergence.
Open‑source implementation – Full code, pretrained weights, and scripts for reproducing results are released on GitHub.

Methodology

Object Detection – A pretrained YOLOv10m model scans the scene and returns tight 2‑D bounding boxes for each target object.
Heatmap Regression – Inside each box, a ResNet‑18 backbone processes the cropped RGB patch and predicts a set of heatmaps, one per keypoint. Peaks in the heatmaps give the 2‑D image coordinates of the keypoints.
Pose Solving – The 2‑D keypoints are paired with their known 3‑D coordinates (from the object CAD model). A Perspective‑n‑Point (PnP) solver with RANSAC filters out outliers and returns the 6‑DoF pose (rotation + translation).
RGB‑D Fusion (optional) – When depth is available, the RGB and depth streams are merged through a cross‑fusion block that concatenates and linearly mixes features at three depths of the network, allowing the model to leverage geometric cues early on.

All components are trained end‑to‑end (except the detector, which is frozen) using a combination of mean‑square error on heatmaps and a pose‑aware loss that penalizes large reprojection errors.

Results & Findings

Model	Input	Mean ADD‑A (LINEMOD)	Inference time (ms)
RGB‑only (baseline)	RGB	84.5 %	~28
RGB‑only (best activations)	RGB	86.2 %	~30
RGB‑D Fusion	RGB + Depth	92.4 %	~35

Keypoint selection matters – Using surface points that are well‑distributed across the object yields ~3 % higher ADD‑A than using only the 8 bounding‑box corners.
Depth adds a big boost – The cross‑fusion design improves pose accuracy by ~8 % absolute while adding only ~7 ms overhead.
Training tricks pay off – Switching to SiLU activation and cosine‑annealing learning rates contributed ~1.5 % gain in the RGB‑only model.

Overall, the approach matches or exceeds recent heavyweight methods (e.g., DenseFusion, PVN3D) while keeping the model size under 15 M parameters.

Practical Implications

Robotics & Automation – The fast detection + lightweight heatmap regressor can run on edge devices (e.g., Jetson Nano) for real‑time bin‑picking or assembly line verification.
AR/VR & Mixed Reality – Accurate 6‑DoF pose from a single RGB frame enables stable object anchoring for AR overlays without needing expensive depth sensors.
Warehouse & Logistics – The optional RGB‑D fusion works well with commodity depth cameras (Intel RealSense, Azure Kinect), giving a plug‑and‑play upgrade path for existing vision pipelines.
Developer-friendly – Because the pipeline builds on popular frameworks (YOLOv10m, PyTorch) and provides ready‑to‑run scripts, teams can integrate it into existing perception stacks with minimal code changes.

Limitations & Future Work

Dataset scope – Evaluation is limited to the LINEMOD benchmark; performance on highly cluttered or texture‑less objects remains untested.
Depth quality dependency – The fusion gains diminish with noisy or low‑resolution depth maps, suggesting a need for more robust depth preprocessing.
Keypoint annotation overhead – Defining optimal keypoint sets per object still requires manual CAD processing; automating this step could broaden applicability.

The authors propose extending the framework to handle multiple objects simultaneously, exploring self‑supervised keypoint discovery, and benchmarking on larger, more diverse datasets (e.g., YCB‑Video, BOP).

Authors

Ismail Aljosevic
Amir Masoud Almasi
Ana Parovic
Ashkan Shafiei

Paper Information

arXiv ID: 2605.08059v1
Categories: cs.CV, cs.RO
Published: May 8, 2026
PDF: Download PDF

[Paper] 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment