[Paper] SARAH: Spatially Aware Real-time Agentic Humans

Published: 3 days ago (February 20, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18432v1

Overview

The paper introduces SARAH – a real‑time, fully causal system that gives virtual agents spatial awareness during conversations. By jointly processing a user’s 3‑D position and dyadic audio, SARAH generates full‑body motion that not only syncs gestures with speech but also orients the avatar toward the interlocutor and controls gaze intensity on the fly. This is the first approach that can run on a streaming VR headset at hundreds of frames per second, opening the door to truly interactive digital humans.

Key Contributions

First causal, streaming architecture for spatially‑aware conversational motion, enabling inference on low‑latency devices (e.g., VR headsets).
Hybrid VAE‑Transformer + flow‑matching model that interleaves latent tokens for continuous streaming and conditions motion on both user trajectory and audio.
Gaze scoring mechanism with classifier‑free guidance, separating learned natural eye‑contact behavior from user‑controlled gaze intensity at inference time.
State‑of‑the‑art motion quality on the Embody 3D dataset, achieving >300 FPS (≈3× faster than prior non‑causal baselines).
Live VR demo validating end‑to‑end deployment in a telepresence scenario.

Methodology

Input Stream – The system receives two real‑time streams: (a) the user’s 3‑D position (head and hand trackers) and (b) a dyadic audio waveform.
Causal VAE‑Transformer – A variational auto‑encoder encodes past motion frames into a latent space. A causal transformer processes these latents token‑by‑token, ensuring that each output only depends on already‑observed data (no future look‑ahead).
Interleaved Latent Tokens – To support continuous streaming, latent tokens are interleaved with “control tokens” that carry the latest user pose and audio features, allowing the model to update its prediction at every frame.
Flow‑Matching Decoder – Instead of a traditional autoregressive decoder, a flow‑matching network directly maps latent trajectories to full‑body joint positions, conditioned on the user’s trajectory and audio. This yields fast, high‑fidelity motion synthesis.
Gaze Scoring & Classifier‑Free Guidance – A lightweight classifier predicts a “gaze score” (how much eye contact is natural) from the latent representation. During inference, developers can steer this score up or down, effectively controlling how strongly the avatar looks at the user without retraining the model.

Results & Findings

Motion Quality – On the Embody 3D benchmark, SARAH outperforms prior non‑causal methods in both objective metrics (e.g., lower mean per‑joint error) and human perceptual studies (participants rated SARAH’s avatars as more natural).
Speed – The pipeline runs at >300 FPS on a consumer‑grade GPU, which is roughly 3× faster than the best non‑causal baseline, meeting the sub‑10 ms latency requirement for immersive VR.
Spatial Dynamics – The model captures subtle conversational cues: turning the torso toward a moving user, adjusting shoulder orientation, and modulating gaze based on the learned scoring function.
Live Demo – In a VR telepresence test, users reported smoother interaction and a stronger sense of presence compared to a static‑avatar control.

Practical Implications

VR/AR Telepresence – Developers can embed SARAH into social VR platforms, enabling avatars that automatically face and look at participants, making remote meetings feel more natural.
Digital Assistants & Training Simulations – Real‑time spatial awareness allows virtual coaches, customer‑service bots, or medical trainers to respond to a trainee’s position, improving engagement and learning outcomes.
Game Development – NPCs can now maintain believable eye contact and body orientation during cut‑scenes or interactive dialogue without pre‑recorded animation blends.
Low‑Latency Deployment – Because the method is fully causal and runs at hundreds of FPS, it fits on edge devices (standalone VR headsets, AR glasses) without needing a cloud backend, preserving privacy and reducing bandwidth.

Limitations & Future Work

Dataset Bias – SARAH is trained on the Embody 3D dataset, which primarily contains scripted, dyadic conversations; performance in crowded or highly dynamic multi‑user scenes remains untested.
Audio‑Only Conditioning – The system relies on clean dyadic audio; noisy environments or overlapping speech could degrade gesture‑speech alignment.
Fine‑Grained Control – While gaze intensity is controllable, other expressive parameters (e.g., facial micro‑expressions, hand‑gesture style) are not explicitly exposed to developers.
Future Directions – Extending the model to multi‑person settings, integrating robust speech‑separation front‑ends, and adding user‑editable style tokens for personalized motion are highlighted as next steps.

Authors

Evonne Ng
Siwei Zhang
Zhang Chen
Michael Zollhoefer
Alexander Richard

Paper Information

arXiv ID: 2602.18432v1
Categories: cs.CV
Published: February 20, 2026
PDF: Download PDF

[Paper] SARAH: Spatially Aware Real-time Agentic Humans

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks

[Paper] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation