[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Published: (April 23, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21926v1

Overview

The paper “Seeing Without Eyes: 4D Human‑Scene Understanding from Wearable IMUs” demonstrates that a handful of everyday inertial sensors (e.g., earbuds, smartwatches, phones) can be used to reconstruct a person’s full 3‑D motion and a coarse 3‑D layout of the surrounding environment—without a single camera frame. By repurposing large language models (LLMs) as spatio‑temporal reasoning engines, the authors show that “vision‑free” perception can reach a level of coherence and stability previously only achievable with visual pipelines.

Key Contributions

  • IMU‑to‑4D framework: a novel pipeline that converts raw inertial‑measurement‑unit (IMU) streams into a unified 4‑D (3‑D space + time) representation of human pose and scene geometry.
  • LLM‑based spatio‑temporal reasoning: adapts pretrained large language models to interpret non‑visual sensor sequences, treating them as a “language” of motion.
  • End‑to‑end training on multi‑modal datasets: leverages existing motion‑capture and synthetic scene datasets to teach the model the relationship between body dynamics and surrounding structures.
  • Benchmark‑level performance: outperforms state‑of‑the‑art cascaded IMU‑only pipelines on several public human‑scene benchmarks, delivering smoother trajectories and more plausible scene layouts.
  • Hardware‑agnostic design: works with as few as three low‑cost IMUs placed on typical consumer devices, making large‑scale deployment feasible.

Methodology

  1. Sensor Collection – The system ingests synchronized tri‑axial accelerometer and gyroscope streams from a small set of wearables (e.g., left earbud, right wrist, pocket phone).
  2. Pre‑processing – Raw signals are windowed, normalized, and embedded into a token sequence similar to words in a sentence.
  3. LLM Encoder‑Decoder – A pretrained transformer‑based language model (e.g., LLaMA) is fine‑tuned to map the tokenized IMU stream to a latent representation that captures both body kinematics and environmental constraints.
  4. 4‑D Decoder – Two parallel heads decode the latent code:
    • Human pose head predicts SMPL‑X parameters for each frame, yielding a continuous 3‑D skeleton and mesh.
    • Scene head predicts a voxel‑grid or low‑resolution mesh of static obstacles (walls, furniture) that best explains the observed motion dynamics.
  5. Temporal Consistency Losses – Smoothness regularizers and physics‑inspired constraints (e.g., foot‑ground contact) are applied during training to enforce realistic motion over time.

The whole pipeline runs in a single forward pass, eliminating the need for separate detection, tracking, and reconstruction stages typical of visual pipelines.

Results & Findings

DatasetMetric (Pose)Metric (Scene)Qualitative Note
Human3.6M‑Scene (synthetic)MPJPE ↓ 12.4 mm (‑18 % vs. baseline)IoU ↑ 0.31 (↑ 22 %)Recovered room layout despite occluded limbs
TotalCapture‑IMUAcceleration error ↓ 9 %Scene silhouette alignment ↑ 0.27Temporal drift virtually eliminated
Real‑world wearables (5 participants)Consistent gait cycles, < 5 mm jitterDetected walls/furniture within 0.2 mWorks with off‑the‑shelf earbuds & phone

Key Takeaways

  • Temporal stability is markedly higher than cascaded methods that first estimate pose then scene.
  • The model can infer scene geometry purely from motion constraints (e.g., a sudden stop implies a wall).
  • Even with sparse sensor placement, the system recovers full‑body meshes that are visually comparable to camera‑based reconstructions.

Practical Implications

  • Privacy‑first AR/VR: Developers can build immersive experiences that track users’ full-body motion without cameras, sidestepping GDPR‑type concerns.
  • Workplace safety & ergonomics: Wearable IMUs can continuously monitor workers’ posture and detect hazardous obstacles in real time, feeding alerts to safety dashboards.
  • Robotics & human‑robot collaboration: Robots equipped only with inertial data from nearby humans can anticipate motions and adjust paths, reducing reliance on vision in low‑light or cluttered environments.
  • Energy‑efficient edge devices: IMU sampling consumes orders of magnitude less power than video capture; the proposed model can run on modern smartphones or dedicated micro‑controllers with on‑device inference accelerators.
  • Scalable data collection: Large‑scale studies (e.g., population‑level activity monitoring) become feasible because participants only need to wear everyday devices, not specialized camera rigs.

Limitations & Future Work

  • Coarse scene granularity – The reconstructed environment is limited to large static structures; fine details (e.g., small objects on a desk) remain out of reach.
  • Sensor placement sensitivity – Accuracy drops when the IMU set deviates significantly from the training configuration (e.g., missing wrist sensor).
  • Generalization to highly dynamic scenes – Rapid interactions with moving objects (e.g., catching a ball) challenge the current static‑scene assumption.
  • Future directions proposed by the authors include integrating additional low‑cost modalities (magnetometers, barometers), refining the scene decoder to output higher‑resolution meshes, and exploring self‑supervised pre‑training on massive unlabeled IMU streams to improve robustness across diverse wearables.

Authors

  • Hao‑Yu Hsu
  • Tianhang Cheng
  • Jing Wen
  • Alexander G. Schwing
  • Shenlong Wang

Paper Information

  • arXiv ID: 2604.21926v1
  • Categories: cs.CV
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »