[Paper] FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations
Source: arXiv - 2602.17573v1
Overview
First‑Responder teams increasingly rely on robots to reach hazardous zones, but controlling those machines in chaotic, hands‑busy environments remains a challenge. The FR‑GESTURE paper introduces the first publicly available RGB‑D dataset tailored for gesture‑based control of unmanned ground vehicles (UGVs) in emergency scenarios, providing a solid foundation for AI models that can interpret natural hand signals under realistic field conditions.
Key Contributions
- Domain‑specific gesture set – 12 command gestures derived from actual first‑responder hand signals and refined through expert feedback.
- Comprehensive RGB‑D collection – 3,312 synchronized color‑depth image pairs captured from two camera viewpoints and seven distances, mimicking the varied perspectives a robot may see.
- Standardized evaluation protocols – Clear train/validation/test splits and performance metrics to benchmark future models on the same footing.
- Baseline benchmarks – Implementation of several state‑of‑the‑art CNN and point‑cloud networks (e.g., ResNet‑50, PointNet++) with reported accuracies, establishing a performance floor for the community.
- Open‑source release – Dataset, code, and evaluation scripts are publicly hosted (Zenodo DOI), encouraging reproducible research and rapid iteration.
Methodology
- Gesture Design – Researchers consulted tactical hand‑signal manuals and interviewed active first responders. After an iterative review, 12 gestures (e.g., “stop”, “move forward”, “turn left”) were selected.
- Data Capture – A volunteer performed each gesture while a calibrated RGB‑D sensor (Microsoft Azure Kinect) recorded from two fixed camera positions (front and side) at seven distances (0.5 m – 3 m). Each recording yields a synchronized RGB image and a depth map, forming an RGB‑D pair.
- Annotation & Pre‑processing – Frames were automatically labeled with the corresponding command; noisy frames were manually pruned. Depth maps were aligned to RGB, normalized, and stored in a compact format.
- Baseline Models – The authors trained:
- 2‑D CNNs on RGB images (ResNet‑18/50).
- 3‑D CNNs on stacked RGB‑D tensors.
- Point‑cloud networks (PointNet++) on depth‑derived point clouds.
Evaluation follows the predefined splits, reporting overall accuracy, per‑class recall, and confusion matrices.
Results & Findings
- Best overall accuracy: 87.3 % using a 3‑D CNN that fuses RGB and depth channels early in the network.
- Depth‑only models lag behind RGB‑only (≈ 73 % vs. ≈ 81 %) but excel on gestures performed at larger distances, where depth provides robust shape cues.
- Confusion patterns: “Turn left” vs. “Turn right” are the most frequently mixed classes, suggesting that subtle finger orientation is still hard for current models.
- Viewpoint robustness: Models trained on both viewpoints achieve ≈ 5 % higher accuracy than those trained on a single view, highlighting the importance of multi‑angle data.
Practical Implications
- Plug‑and‑play robot control: Developers can integrate a pre‑trained gesture recognizer into UGV software stacks, enabling hands‑free command issuance in smoke‑filled, noisy, or radio‑dead zones.
- Edge deployment: Since the baseline CNNs run at > 30 fps on a modest NVIDIA Jetson Nano, real‑time inference is feasible on board the robot, reducing latency and dependence on external compute.
- Training data for transfer learning: The RGB‑D pairs can be fine‑tuned for related tasks—e.g., gesture‑based drone piloting or wearable AR assistants for firefighters.
- Standard benchmark: The evaluation protocols give product teams a common yardstick to compare custom models, accelerating the maturity of gesture‑based HRI (human‑robot interaction) solutions for emergency services.
Limitations & Future Work
- Controlled environment: Recordings were performed in a lab‑like setting; real disaster scenes introduce occlusions, extreme lighting, and protective gear that may degrade performance.
- Limited participant diversity: Only a handful of volunteers contributed gestures, so the dataset may not capture the full variability of hand sizes, skin tones, or motion styles across responder populations.
- Static gestures only: Dynamic sequences (e.g., waving to attract attention) are absent, limiting applicability to continuous interaction scenarios.
- Future directions suggested by the authors include expanding the dataset with outdoor, low‑light captures, adding more participants, incorporating temporal gesture streams, and exploring multimodal fusion with audio or inertial sensors.
The FR‑GESTURE dataset opens the door for developers to prototype robust, hands‑free control interfaces for rescue robots—an essential step toward safer, more efficient emergency response.
Authors
- Konstantinos Foteinos
- Georgios Angelidis
- Aggelos Psiris
- Vasileios Argyriou
- Panagiotis Sarigiannidis
- Georgios Th. Papadopoulos
Paper Information
- arXiv ID: 2602.17573v1
- Categories: cs.RO, cs.CV
- Published: February 19, 2026
- PDF: Download PDF