[Paper] ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Published: 10 hours ago (March 9, 2026 at 01:49 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08681v1

Overview

The paper ER‑Pose challenges the dominant “box‑driven” mindset in single‑stage, real‑time multi‑person pose estimation. By flipping the training focus from bounding‑box supervision to a keypoint‑driven paradigm, the authors achieve higher accuracy while keeping the model lightweight and fast enough for on‑device or edge deployments.

Key Contributions

Keypoint‑first formulation – removes the bounding‑box head and treats human pose as the primary prediction target.
Dynamic, keypoint‑driven sample assignment – aligns training sample selection with the OKS (Object Keypoint Similarity) metric used at test time, eliminating the need for NMS.
Redesigned prediction head – tailored for high‑dimensional, structured keypoint representations, improving feature utilization.
Smooth OKS‑based loss – stabilizes regression of keypoint coordinates, reducing gradient spikes common in pose‑specific losses.
ER‑Pose framework – a single‑stage architecture that outperforms the YOLO‑Pose baseline on COCO and CrowdPose with fewer parameters and higher FPS.

Methodology

Traditional single‑stage pose detectors inherit the object‑detection pipeline: a backbone extracts features, a detection head predicts bounding boxes, and a separate branch regresses keypoints inside those boxes. This coupling forces the model to satisfy two competing objectives (box accuracy vs. keypoint accuracy), leading to sub‑optimal feature learning.

ER‑Pose discards the box branch entirely:

Backbone + Shared Feature Map – a YOLO‑style multi‑scale feature pyramid feeds a single prediction head.
Keypoint‑Centric Head – outputs a dense heatmap for each joint plus offset vectors that locate each keypoint relative to the pixel location, enabling direct regression without a bounding‑box scaffold.
Dynamic Sample Assignment – during training, each ground‑truth person is matched to the most appropriate feature cell based on OKS, not IoU. This yields dense, well‑aligned supervision.
Smooth OKS Loss – a differentiable approximation of the OKS metric that penalizes both localization error and confidence mismatch, providing a smoother gradient landscape for regression.

Because there is no box branch, the model can skip the costly Non‑Maximum Suppression (NMS) step; the keypoint‑driven assignment already resolves duplicate detections.

Results & Findings

Dataset	Setting	AP ↑	Params ↓	FPS ↑
COCO (no pre‑train)	ER‑Pose‑n vs. YOLO‑Pose	+3.2	–	–
COCO (with pre‑train)	ER‑Pose‑n vs. YOLO‑Pose	+7.4	–	–
CrowdPose (no pre‑train)	ER‑Pose‑n vs. YOLO‑Pose	+6.7	–	–
CrowdPose (with pre‑train)	ER‑Pose‑n vs. YOLO‑Pose	+4.9	–	–

“↑” denotes improvement; “↓” denotes reduction.

The gains come without extra backbone depth or width, confirming that the keypoint‑driven redesign is the primary driver of performance. Moreover, inference speed improves because the model eliminates the box head and NMS, making it attractive for real‑time applications.

Practical Implications

Edge & Mobile Deployments – The reduced parameter count and higher FPS mean ER‑Pose can run comfortably on smartphones, AR glasses, or embedded GPUs while delivering more reliable joint localization.
Simplified Pipelines – By removing the box branch and NMS, developers can integrate pose estimation as a single, self‑contained module, lowering engineering overhead.
Better Downstream Tasks – More accurate keypoints improve downstream analytics such as action recognition, gesture control, and sports performance tracking, especially in crowded scenes where box‑driven methods struggle.
Transferability – The keypoint‑driven loss and assignment strategy are architecture‑agnostic; they can be grafted onto other real‑time detectors (e.g., SSD, EfficientDet) to boost pose performance without redesigning the whole network.

Limitations & Future Work

Bounding‑Box Utility Ignored – While eliminating boxes speeds up inference, some applications still need precise person localization (e.g., multi‑modal sensor fusion).
Dataset Bias – The method is evaluated on COCO and CrowdPose; performance on highly occluded or low‑resolution datasets remains to be verified.
Scalability to Ultra‑High‑Resolution – The current design assumes a fixed feature stride; adapting to very high‑resolution inputs may require additional multi‑scale tricks.

Future research directions suggested by the authors include extending the keypoint‑driven paradigm to 3‑D pose estimation, exploring hybrid box‑keypoint supervision for tasks that need both, and further optimizing the smooth OKS loss for even faster convergence.

Authors

Nanjun Li
Pinqi Cheng
Zean Liu
Minghe Tian
Xuanyin Wang

Paper Information

arXiv ID: 2603.08681v1
Categories: cs.CV
Published: March 9, 2026
PDF: Download PDF

[Paper] ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Space Diffusion

[Paper] FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

[Paper] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

[Paper] Talking Together: Synthesizing Co-Located 3D Conversations from Audio