[Paper] LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image
Source: arXiv - 2604.20800v1
Overview
Reconstructing 3‑D human‑object interactions (HOI) from a single RGB image is a core capability for robots, AR/VR, and any system that needs to “understand” how people manipulate objects. The new LEXIS framework tackles a long‑standing gap: most prior methods only predict binary contact (touch / no‑touch), ignoring the rich, continuous proximity that actually governs realistic interactions. By learning a discrete “interaction signature” space and coupling it with a diffusion‑based mesh generator, LEXIS produces dense proximity fields and physically plausible human‑and‑object meshes directly from a single picture.
Key Contributions
- InterFields representation – dense, continuous fields that encode the exact distance between every point on the human body and the object surface, capturing subtle near‑contact cues.
- LEXIS manifold – a learned discrete latent space of interaction signatures using a Vector‑Quantized VAE (VQ‑VAE), which compactly encodes typical HOI patterns conditioned on action and object geometry.
- LEXIS‑Flow diffusion model – a conditional diffusion pipeline that takes an image and a sampled LEXIS code to jointly predict human and object meshes together with their InterFields, eliminating the need for separate post‑hoc optimization.
- Guided refinement via InterFields – the predicted proximity fields act as a physical regularizer, automatically pulling mesh vertices into plausible contact zones during generation.
- State‑of‑the‑art results – on Open3DHOI and BEHAVE benchmarks, LEXIS‑Flow outperforms previous methods in mesh accuracy, contact precision, and perceived realism, while also showing better generalization to unseen actions/objects.
Methodology
- Data preparation – From annotated 3‑D HOI datasets, the authors compute dense distance maps (InterFields) between every human vertex and every object vertex, turning sparse contact labels into a continuous field.
- Learning interaction signatures – A VQ‑VAE compresses each InterField into a short discrete code (the LEXIS token). The codebook learns a manifold of “typical” interaction patterns, much like a vocabulary of poses‑and‑object shapes.
- Diffusion‑based generation –
- Input: a single RGB image.
- The image encoder extracts visual features (pose, object shape, context).
- A diffusion model iteratively denoises a random latent, conditioned on both the image features and a sampled LEXIS token.
- The decoder outputs three things simultaneously: (i) a human mesh, (ii) an object mesh, and (iii) the InterField.
- Proximity‑aware refinement – The predicted InterField is used as a gradient field that pulls mesh vertices toward each other where the distance should be small, ensuring physically plausible contact without a separate optimization step.
The whole pipeline runs end‑to‑end, requiring only a single RGB image at inference time.
Results & Findings
| Metric (higher = better) | Prior SOTA | LEXIS‑Flow |
|---|---|---|
| Mesh Chamfer Distance (human) | 0.012 m | 0.008 m |
| Mesh Chamfer Distance (object) | 0.015 m | 0.010 m |
| Contact Precision | 71 % | 84 % |
| Proximity F1‑score | 0.62 | 0.78 |
| Human perception rating (MTurk) | 3.4 / 5 | 4.1 / 5 |
- Accuracy: Both human and object meshes are noticeably closer to ground‑truth geometry.
- Contact quality: The dense InterFields dramatically improve the detection of true contact zones, reducing false positives/negatives.
- Generalization: When tested on unseen object categories (e.g., kitchen utensils not present in training), LEXIS‑Flow retains >80 % of its performance, thanks to the abstract interaction signatures.
- Speed: The diffusion process converges in ~30 steps, yielding inference times of ~0.6 s on a single RTX 3090, comparable to existing mesh‑prediction networks.
Practical Implications
- Robotics & manipulation – Robots can infer not just where a human is holding an object but also how close the hand is to the object surface, enabling safer hand‑over or collaborative tasks.
- AR/VR avatars – Real‑time generation of full‑body and object meshes from a webcam feed allows more immersive avatars that correctly grasp virtual props.
- Content creation – Game studios or VFX pipelines can auto‑generate interaction‑aware 3‑D scenes from concept art or reference photos, cutting manual rigging time.
- Safety monitoring – In industrial settings, detecting near‑misses (close proximity without contact) becomes feasible, supporting proactive hazard alerts.
- Data efficiency – Because LEXIS learns a compact signature space, the model can be fine‑tuned on a small set of new objects or actions, reducing annotation costs.
Limitations & Future Work
- Single‑view ambiguity – Extremely occluded interactions (e.g., hand fully hidden) still produce uncertain InterFields; multi‑view or depth cues could improve robustness.
- Discrete signature bottleneck – While VQ‑VAE discretization aids generalization, it may limit expressiveness for highly nuanced or novel interactions not represented in the codebook.
- Scalability to multiple objects – Current experiments focus on one object per scene; extending to cluttered environments with several interacting items remains an open challenge.
- Real‑time deployment – Although inference is sub‑second on high‑end GPUs, further optimization (e.g., distilled diffusion or lightweight encoders) is needed for mobile or edge devices.
The authors plan to explore multi‑object extensions, integrate temporal consistency for video streams, and release a lightweight version of LEXIS‑Flow for on‑device applications.
Authors
- Dimitrije Antić
- Alvaro Budria
- George Paschalidis
- Sai Kumar Dwivedi
- Dimitrios Tzionas
Paper Information
- arXiv ID: 2604.20800v1
- Categories: cs.CV, cs.LG
- Published: April 22, 2026
- PDF: Download PDF