[Paper] SARAH: Spatially Aware Real-time Agentic Humans
Source: arXiv - 2602.18432v1
Overview
The paper introduces SARAH – a real‑time, fully causal system that gives virtual agents spatial awareness during conversations. By jointly processing a user’s 3‑D position and dyadic audio, SARAH generates full‑body motion that not only syncs gestures with speech but also orients the avatar toward the interlocutor and controls gaze intensity on the fly. This is the first approach that can run on a streaming VR headset at hundreds of frames per second, opening the door to truly interactive digital humans.
Key Contributions
- First causal, streaming architecture for spatially‑aware conversational motion, enabling inference on low‑latency devices (e.g., VR headsets).
- Hybrid VAE‑Transformer + flow‑matching model that interleaves latent tokens for continuous streaming and conditions motion on both user trajectory and audio.
- Gaze scoring mechanism with classifier‑free guidance, separating learned natural eye‑contact behavior from user‑controlled gaze intensity at inference time.
- State‑of‑the‑art motion quality on the Embody 3D dataset, achieving >300 FPS (≈3× faster than prior non‑causal baselines).
- Live VR demo validating end‑to‑end deployment in a telepresence scenario.
Methodology
- Input Stream – The system receives two real‑time streams: (a) the user’s 3‑D position (head and hand trackers) and (b) a dyadic audio waveform.
- Causal VAE‑Transformer – A variational auto‑encoder encodes past motion frames into a latent space. A causal transformer processes these latents token‑by‑token, ensuring that each output only depends on already‑observed data (no future look‑ahead).
- Interleaved Latent Tokens – To support continuous streaming, latent tokens are interleaved with “control tokens” that carry the latest user pose and audio features, allowing the model to update its prediction at every frame.
- Flow‑Matching Decoder – Instead of a traditional autoregressive decoder, a flow‑matching network directly maps latent trajectories to full‑body joint positions, conditioned on the user’s trajectory and audio. This yields fast, high‑fidelity motion synthesis.
- Gaze Scoring & Classifier‑Free Guidance – A lightweight classifier predicts a “gaze score” (how much eye contact is natural) from the latent representation. During inference, developers can steer this score up or down, effectively controlling how strongly the avatar looks at the user without retraining the model.
Results & Findings
- Motion Quality – On the Embody 3D benchmark, SARAH outperforms prior non‑causal methods in both objective metrics (e.g., lower mean per‑joint error) and human perceptual studies (participants rated SARAH’s avatars as more natural).
- Speed – The pipeline runs at >300 FPS on a consumer‑grade GPU, which is roughly 3× faster than the best non‑causal baseline, meeting the sub‑10 ms latency requirement for immersive VR.
- Spatial Dynamics – The model captures subtle conversational cues: turning the torso toward a moving user, adjusting shoulder orientation, and modulating gaze based on the learned scoring function.
- Live Demo – In a VR telepresence test, users reported smoother interaction and a stronger sense of presence compared to a static‑avatar control.
Practical Implications
- VR/AR Telepresence – Developers can embed SARAH into social VR platforms, enabling avatars that automatically face and look at participants, making remote meetings feel more natural.
- Digital Assistants & Training Simulations – Real‑time spatial awareness allows virtual coaches, customer‑service bots, or medical trainers to respond to a trainee’s position, improving engagement and learning outcomes.
- Game Development – NPCs can now maintain believable eye contact and body orientation during cut‑scenes or interactive dialogue without pre‑recorded animation blends.
- Low‑Latency Deployment – Because the method is fully causal and runs at hundreds of FPS, it fits on edge devices (standalone VR headsets, AR glasses) without needing a cloud backend, preserving privacy and reducing bandwidth.
Limitations & Future Work
- Dataset Bias – SARAH is trained on the Embody 3D dataset, which primarily contains scripted, dyadic conversations; performance in crowded or highly dynamic multi‑user scenes remains untested.
- Audio‑Only Conditioning – The system relies on clean dyadic audio; noisy environments or overlapping speech could degrade gesture‑speech alignment.
- Fine‑Grained Control – While gaze intensity is controllable, other expressive parameters (e.g., facial micro‑expressions, hand‑gesture style) are not explicitly exposed to developers.
- Future Directions – Extending the model to multi‑person settings, integrating robust speech‑separation front‑ends, and adding user‑editable style tokens for personalized motion are highlighted as next steps.
Authors
- Evonne Ng
- Siwei Zhang
- Zhang Chen
- Michael Zollhoefer
- Alexander Richard
Paper Information
- arXiv ID: 2602.18432v1
- Categories: cs.CV
- Published: February 20, 2026
- PDF: Download PDF