[Paper] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Source: arXiv - 2603.17980v1
Overview
The paper introduces Motion‑MLLM, a multimodal large language model that fuses video frames with egomotion data from an inertial measurement unit (IMU). By grounding visual content in real‑world motion cues, the system can reason about absolute scale and spatial relationships in 3‑D scenes while cutting down the heavy compute cost of traditional point‑cloud or bird‑eye‑view pipelines.
Key Contributions
- Egomotion‑aware representation: Integrates raw IMU signals (acceleration, gyroscope) with video to give the model a physical sense of movement.
- Cascaded motion‑visual keyframe filter: Uses both motion and visual similarity to pick a sparse set of representative frames, dramatically reducing the amount of data the model must process.
- Asymmetric cross‑modal fusion: Treats motion tokens as “intermediaries” that inject egomotion context into visual embeddings, preserving temporal continuity without blowing up token count.
- Cost‑effective performance: Achieves comparable or better accuracy than state‑of‑the‑art video‑only and explicit‑3D methods while being 1.4×–1.6× more cost‑effective (fewer FLOPs, lower latency).
- Broad evaluation suite: Demonstrates gains across several 3‑D scene‑understanding benchmarks (e.g., depth estimation, object scale inference, spatial question answering).
Methodology
- Data Capture – A standard RGB camera records video while an IMU attached to the same device streams 6‑DoF motion data (linear acceleration + angular velocity).
- Keyframe Selection –
- Motion cue: Compute a short‑term trajectory descriptor from IMU readings; large changes indicate a potential keyframe.
- Visual cue: Extract lightweight CNN features from each frame; high visual novelty also flags a keyframe.
- The two cues are combined in a cascade: motion first prunes obvious redundancies, then visual similarity refines the set, yielding a compact frame subset (≈10‑15 % of the original frames).
- Tokenization – Each selected frame is turned into visual tokens (ViT patches). Simultaneously, the IMU stream is discretized into motion tokens that encode velocity, orientation, and derived egomotion vectors.
- Asymmetric Cross‑Modal Fusion –
- Motion tokens are fed into a shallow transformer that produces a motion context vector.
- This vector is concatenated with visual tokens before the main LLM encoder, acting as a “bridge” that injects absolute scale and trajectory information without requiring a full‑blown 3‑D point‑cloud encoder.
- LLM Reasoning – The fused token sequence is processed by a pretrained multimodal LLM (e.g., LLaVA, MiniGPT‑4) which can now answer spatial queries, generate scene descriptions, or predict depth/scale.
The whole pipeline runs end‑to‑end on a single GPU, and because only a handful of keyframes are processed, memory and compute footprints stay modest.
Results & Findings
| Task | Baseline (Video‑only) | Baseline (3‑D point cloud) | Motion‑MLLM |
|---|---|---|---|
| Absolute scale estimation (m) | ±0.48 | ±0.31 | ±0.27 |
| Depth prediction (RMSE) | 0.62 | 0.55 | 0.53 |
| Spatial QA accuracy | 71.2 % | 73.8 % | 75.6 % |
| FLOPs (relative) | 1.0× | 1.3× | 0.71× |
- Accuracy: Motion‑MLLM matches or exceeds the best 3‑D‑aware models on all tested metrics.
- Efficiency: By processing ~12 % of frames, the system reduces FLOPs by ~30 % and cuts inference latency from ~250 ms to ~170 ms per query on an RTX 3080.
- Robustness: Ablation studies show that removing the motion‑visual filter drops performance by ~8 % and that motion tokens alone (without visual context) are insufficient for fine‑grained reasoning, confirming the synergy of the two modalities.
Practical Implications
- AR/VR & Robotics: Devices equipped with cheap IMUs (smartphones, drones, wearables) can now obtain reliable 3‑D understanding without expensive LiDAR or depth sensors, enabling more accurate placement of virtual objects or safer navigation.
- Edge Deployment: The keyframe‑filtering strategy makes it feasible to run egomotion‑aware scene reasoning on edge GPUs or even on‑device NPUs, opening doors for real‑time assistance apps (e.g., “measure this object” or “find the exit”).
- Content Creation: Video editors and game developers can automatically generate scene‑scale metadata (camera path, object dimensions) from raw footage, streamlining VFX pipelines.
- Multimodal LLM Integration: The asymmetric fusion design can be retro‑fitted into existing multimodal LLMs, giving them a physical grounding layer without retraining the entire vision encoder.
Limitations & Future Work
- Sensor Quality Dependency: Noisy IMU data (common in low‑cost devices) can degrade motion token reliability; the authors suggest sensor‑fusion or denoising pre‑processors as a remedy.
- Static Scenes: The current framework assumes the camera motion dominates the scene dynamics; heavily moving objects (e.g., crowds) may still confuse scale inference.
- Generalization to Outdoor Environments: Benchmarks focus on indoor or controlled settings; extending to large‑scale outdoor scenes (e.g., autonomous driving) will require handling GPS drift and longer trajectories.
- Future Directions: The authors plan to explore learned motion token embeddings (instead of handcrafted discretization), integrate audio cues for richer context, and test on on‑device hardware accelerators.
Authors
- Shuyao Shi
- Kang G. Shin
Paper Information
- arXiv ID: 2603.17980v1
- Categories: cs.CV
- Published: March 18, 2026
- PDF: Download PDF