[Paper] Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning
Source: arXiv - 2604.26873v1
Overview
The paper introduces UAPAR, a novel framework that brings uncertainty awareness to pedestrian attribute recognition (PAR). By integrating Evidential Deep Learning (EDL) with a CLIP‑style vision‑language backbone, the system can flag predictions it isn’t confident about—an ability that traditional deterministic models lack, especially when dealing with low‑quality or noisy data.
Key Contributions
- First EDL‑based PAR system that quantifies epistemic uncertainty for each attribute.
- Region‑Aware Evidence Reasoning (RAER) module: uses cross‑attention and spatial priors to harvest fine‑grained local cues before feeding them to an evidential head.
- Uncertainty‑guided dual‑stage curriculum learning: dynamically adjusts the training curriculum to mitigate the impact of noisy labels.
- Extensive validation on four large‑scale datasets (PA100K, PETA, RAPv1, RAPv2) showing competitive or state‑of‑the‑art accuracy while also delivering reliable uncertainty estimates.
- Qualitative analysis demonstrating that high uncertainty scores correlate with challenging or mis‑predicted samples.
Methodology
-
Backbone – The model builds on a CLIP‑style architecture (image encoder + text encoder) to obtain a rich joint representation of pedestrian images and attribute semantics.
-
Region‑Aware Evidence Reasoning (RAER)
- A cross‑attention block aligns image patches with attribute tokens, allowing the network to focus on the most informative regions (e.g., a backpack, shoes).
- Spatial prior masks (derived from human pose or segmentation cues) guide attention toward plausible body parts, improving local feature extraction.
-
Evidential Head
- Instead of outputting a single softmax probability, the head predicts evidence parameters of a Dirichlet distribution for each attribute.
- From the Dirichlet, both the expected class probability and the epistemic uncertainty (variance) are derived.
-
Uncertainty‑Guided Curriculum Learning
- Stage 1: Train on “easy” samples (low uncertainty) to establish a solid base.
- Stage 2: Gradually introduce harder, noisier samples, weighting their loss by the model’s current uncertainty estimate. This prevents noisy labels from overwhelming the learning signal.
The overall pipeline remains end‑to‑end trainable, requiring only standard image‑attribute annotations.
Results & Findings
| Dataset | mA (Mean Accuracy) | Uncertainty‑aware mA ↑ | Comments |
|---|---|---|---|
| PA100K | 85.2% | 86.1% | Better handling of occluded or low‑resolution pedestrians |
| PETA | 84.7% | 85.5% | Uncertainty scores correctly flag mislabeled attributes |
| RAPv1 | 88.3% | 89.0% | Gains most pronounced on attributes with high intra‑class variance (e.g., “carrying backpack”) |
| RAPv2 | 87.9% | 88.6% | Qualitative visualizations show high uncertainty on blurred or heavily occluded images |
Key takeaways
- Accuracy boost: modest but consistent improvements over strong baselines.
- Reliability: the epistemic uncertainty correlates strongly (Pearson ≈ 0.73) with prediction errors, enabling downstream systems to discard or re‑process doubtful outputs.
- Robustness to label noise: the curriculum learning scheme reduces performance degradation when up to 30% of training labels are corrupted.
Practical Implications
- Surveillance & Smart Cities: Operators can prioritize human review for high‑uncertainty detections (e.g., a person wearing a mask that obscures facial features), reducing false alarms.
- Autonomous Vehicles: Pedestrian attribute cues (e.g., “carrying a stroller”) influence motion planning; knowing when the attribute estimate is unreliable can trigger fallback strategies.
- Retail & Indoor Analytics: Attribute‑based customer profiling (age, gender, accessories) can be made privacy‑aware by refusing to act on uncertain predictions.
- Model Debugging: Developers get a built‑in diagnostic tool—high uncertainty highlights data collection gaps (poor lighting, unusual poses) that can be addressed in future dataset curation.
- Active Learning: Uncertainty scores can drive sample selection for human annotation, making data‑labeling pipelines more efficient.
Limitations & Future Work
- Computational overhead: The cross‑attention and evidential head add ~15% inference latency compared with a vanilla CLIP classifier, which may be a bottleneck for real‑time edge deployments.
- Scope of attributes: Experiments focus on binary attributes; extending to multi‑class or continuous traits (e.g., “height”) remains unexplored.
- Uncertainty calibration: While epistemic uncertainty is informative, the paper notes occasional over‑confidence on severely corrupted images; better calibration techniques (e.g., temperature scaling) could improve trustworthiness.
- Broader modalities: Incorporating temporal cues from video streams or depth sensors could further reduce uncertainty in challenging scenarios.
Overall, UAPAR opens a promising path toward trustworthy pedestrian attribute systems, giving developers the tools to not only predict “what” but also to gauge “how sure” they are about each prediction.
Authors
- Zhuofan Lou
- Shihang Zhang
- Fangle Zhu
- Shengjie Ye
- Pingyu Wang
Paper Information
- arXiv ID: 2604.26873v1
- Categories: cs.CV
- Published: April 29, 2026
- PDF: Download PDF