[Paper] Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations
Source: arXiv - 2601.11460v1
Overview
This paper tackles a core problem for robot manipulation: how to turn raw human demonstration videos into a compact, reusable representation of a task that captures what is being done (the semantics) and how objects move and relate to each other (the geometry). By introducing a semantic‑geometric task graph and a learning pipeline that separates scene understanding from action planning, the authors show that robots can predict and execute long‑horizon, bimanual tasks more reliably than with plain sequence models.
Key Contributions
- Semantic‑Geometric Task Graph (SGTG): A unified graph structure that encodes object identities, pairwise spatial relations, and their temporal evolution across a demonstration.
- Hybrid Encoder‑Decoder Architecture:
- Encoder: A Message Passing Neural Network (MPNN) that ingests only the temporal scene graphs, learning a structured latent embedding of the task.
- Decoder: A Transformer that conditions on the current action context to forecast future actions, involved objects, and their motions.
- Decoupling of Perception and Reasoning: By learning scene representations independently of the action‑conditioned decoder, the model can be reused across different downstream planners or control loops.
- Empirical Validation on Human Demonstrations: Demonstrates superior performance on datasets with high variability in action order and object interactions, where traditional sequence‑based baselines falter.
- Real‑World Transfer: Shows that the learned task graphs can be deployed on a physical bimanual robot for online action selection, proving the approach is more than a simulation curiosity.
Methodology
-
Data Preparation – Temporal Scene Graphs
- Each frame of a demonstration is parsed into a graph: nodes = objects (with class labels), edges = geometric relations (e.g., distance, relative pose).
- Over time, these graphs form a temporal sequence that captures how relations evolve (e.g., a cup moving toward a hand).
-
Encoder – Message Passing Neural Network
- The MPNN aggregates information across nodes and edges at each timestep, producing a compact embedding that respects the graph’s structure.
- Temporal dynamics are captured by feeding the per‑timestep embeddings into a recurrent module (or a simple temporal pooling).
-
Decoder – Action‑Conditioned Transformer
- Takes the task embedding and a prompt (the current action or a partial plan) as input.
- Autoregressively predicts the next action token, the set of objects involved, and a parametric description of the expected object motion (e.g., a 6‑DoF pose delta).
- The Transformer’s self‑attention lets the model reason about long‑range dependencies (e.g., “pick the cup only after the spoon is placed”).
-
Training Objective
- Multi‑task loss: cross‑entropy for action and object classification + regression loss for geometric motion predictions.
- Teacher‑forcing during training ensures the decoder sees the ground‑truth previous actions, while at test time it operates in a fully autoregressive mode.
-
Deployment on a Bimanual Robot
- The learned graph encoder runs on perception data (RGB‑D + object detection) to produce a task embedding in real time.
- The decoder supplies the next action command, which is fed to a low‑level controller that executes the motion on the robot’s two arms.
Results & Findings
| Metric | Sequence‑Only Baseline | Graph‑Based Model (Ours) |
|---|---|---|
| Top‑1 Action Accuracy (high‑variability tasks) | 62 % | 78 % |
| Object‑Selection F1 | 55 % | 71 % |
| Motion Prediction MAE (cm) | 3.4 | 1.9 |
| Planning Horizon (steps correctly predicted) | 4 | 7 |
- Robustness to Variability: When the same task is demonstrated with different object orders or hand‑over‑hand swaps, the graph model maintains high accuracy, whereas sequence models drop sharply.
- Generalization to Unseen Objects: Because the encoder learns relational patterns rather than raw pixel sequences, it can extrapolate to new objects that share similar geometric roles (e.g., a different mug).
- Real‑World Trial: On a dual‑arm platform, the robot successfully assembled a “plate‑and‑cutlery” setup from human demos, achieving an 85 % success rate over 30 trials, compared to 60 % for a flat‑sequence LSTM baseline.
Practical Implications
- Reusable Task Abstractions: Developers can store an SGTG embedding once a task is demonstrated and reuse it across multiple robots or simulation environments without retraining the whole pipeline.
- Plug‑and‑Play Planning: Since the decoder is action‑conditioned, it can be swapped with existing task‑level planners (e.g., behavior trees) that provide the “prompt” context.
- Better Generalization for Home‑Robotics: Household robots often encounter novel object arrangements; a graph‑centric view lets them infer appropriate actions even when the exact sequence was never seen.
- Scalable Data Collection: Human tele‑operation or video capture can be turned into scene graphs automatically (using off‑the‑shelf object detectors), reducing the need for hand‑crafted annotations.
- Potential for Multi‑Agent Coordination: The same representation could be extended to coordinate multiple robots (or humans) by adding agent nodes and inter‑agent edges, opening doors for collaborative manufacturing or assistive care.
Limitations & Future Work
- Reliance on Accurate Perception: The pipeline assumes reliable object detection and pose estimation; noisy sensors can corrupt the scene graph and degrade performance.
- Fixed Graph Topology: Current graphs only model pairwise relations; higher‑order interactions (e.g., three‑object constraints) are not explicitly captured.
- Scalability to Very Long Horizons: While the Transformer handles longer sequences better than RNNs, inference time grows with horizon length, which may be problematic for real‑time control in complex tasks.
- Future Directions Suggested by the Authors:
- Integrate uncertainty‑aware perception modules to make the graph robust to detection errors.
- Explore hierarchical graph constructions that abstract groups of objects into “compound nodes.”
- Combine the SGTG with reinforcement learning to fine‑tune the decoder’s action proposals based on actual execution feedback.
Bottom line: By marrying semantic task graphs with modern neural encoders/decoders, this work offers a practical pathway for developers to give robots a deeper, more flexible understanding of human demonstrations—moving us a step closer to truly adaptable, task‑agnostic manipulation systems.
Authors
- Franziska Herbert
- Vignesh Prasad
- Han Liu
- Dorothea Koert
- Georgia Chalvatzaki
Paper Information
- arXiv ID: 2601.11460v1
- Categories: cs.RO, cs.LG
- Published: January 16, 2026
- PDF: Download PDF