[Paper] Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Published: 3 weeks ago (January 16, 2026 at 12:35 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.11460v1

Overview

This paper tackles a core problem for robot manipulation: how to turn raw human demonstration videos into a compact, reusable representation of a task that captures what is being done (the semantics) and how objects move and relate to each other (the geometry). By introducing a semantic‑geometric task graph and a learning pipeline that separates scene understanding from action planning, the authors show that robots can predict and execute long‑horizon, bimanual tasks more reliably than with plain sequence models.

Key Contributions

Semantic‑Geometric Task Graph (SGTG): A unified graph structure that encodes object identities, pairwise spatial relations, and their temporal evolution across a demonstration.
Hybrid Encoder‑Decoder Architecture:
- Encoder: A Message Passing Neural Network (MPNN) that ingests only the temporal scene graphs, learning a structured latent embedding of the task.
- Decoder: A Transformer that conditions on the current action context to forecast future actions, involved objects, and their motions.
Decoupling of Perception and Reasoning: By learning scene representations independently of the action‑conditioned decoder, the model can be reused across different downstream planners or control loops.
Empirical Validation on Human Demonstrations: Demonstrates superior performance on datasets with high variability in action order and object interactions, where traditional sequence‑based baselines falter.
Real‑World Transfer: Shows that the learned task graphs can be deployed on a physical bimanual robot for online action selection, proving the approach is more than a simulation curiosity.

Methodology

Data Preparation – Temporal Scene Graphs
- Each frame of a demonstration is parsed into a graph: nodes = objects (with class labels), edges = geometric relations (e.g., distance, relative pose).
- Over time, these graphs form a temporal sequence that captures how relations evolve (e.g., a cup moving toward a hand).
Encoder – Message Passing Neural Network
- The MPNN aggregates information across nodes and edges at each timestep, producing a compact embedding that respects the graph’s structure.
- Temporal dynamics are captured by feeding the per‑timestep embeddings into a recurrent module (or a simple temporal pooling).
Decoder – Action‑Conditioned Transformer
- Takes the task embedding and a prompt (the current action or a partial plan) as input.
- Autoregressively predicts the next action token, the set of objects involved, and a parametric description of the expected object motion (e.g., a 6‑DoF pose delta).
- The Transformer’s self‑attention lets the model reason about long‑range dependencies (e.g., “pick the cup only after the spoon is placed”).
Training Objective
- Multi‑task loss: cross‑entropy for action and object classification + regression loss for geometric motion predictions.
- Teacher‑forcing during training ensures the decoder sees the ground‑truth previous actions, while at test time it operates in a fully autoregressive mode.
Deployment on a Bimanual Robot
- The learned graph encoder runs on perception data (RGB‑D + object detection) to produce a task embedding in real time.
- The decoder supplies the next action command, which is fed to a low‑level controller that executes the motion on the robot’s two arms.

Results & Findings

Metric	Sequence‑Only Baseline	Graph‑Based Model (Ours)
Top‑1 Action Accuracy (high‑variability tasks)	62 %	78 %
Object‑Selection F1	55 %	71 %
Motion Prediction MAE (cm)	3.4	1.9
Planning Horizon (steps correctly predicted)	4	7

Robustness to Variability: When the same task is demonstrated with different object orders or hand‑over‑hand swaps, the graph model maintains high accuracy, whereas sequence models drop sharply.
Generalization to Unseen Objects: Because the encoder learns relational patterns rather than raw pixel sequences, it can extrapolate to new objects that share similar geometric roles (e.g., a different mug).
Real‑World Trial: On a dual‑arm platform, the robot successfully assembled a “plate‑and‑cutlery” setup from human demos, achieving an 85 % success rate over 30 trials, compared to 60 % for a flat‑sequence LSTM baseline.

Practical Implications

Reusable Task Abstractions: Developers can store an SGTG embedding once a task is demonstrated and reuse it across multiple robots or simulation environments without retraining the whole pipeline.
Plug‑and‑Play Planning: Since the decoder is action‑conditioned, it can be swapped with existing task‑level planners (e.g., behavior trees) that provide the “prompt” context.
Better Generalization for Home‑Robotics: Household robots often encounter novel object arrangements; a graph‑centric view lets them infer appropriate actions even when the exact sequence was never seen.
Scalable Data Collection: Human tele‑operation or video capture can be turned into scene graphs automatically (using off‑the‑shelf object detectors), reducing the need for hand‑crafted annotations.
Potential for Multi‑Agent Coordination: The same representation could be extended to coordinate multiple robots (or humans) by adding agent nodes and inter‑agent edges, opening doors for collaborative manufacturing or assistive care.

Limitations & Future Work

Reliance on Accurate Perception: The pipeline assumes reliable object detection and pose estimation; noisy sensors can corrupt the scene graph and degrade performance.
Fixed Graph Topology: Current graphs only model pairwise relations; higher‑order interactions (e.g., three‑object constraints) are not explicitly captured.
Scalability to Very Long Horizons: While the Transformer handles longer sequences better than RNNs, inference time grows with horizon length, which may be problematic for real‑time control in complex tasks.
Future Directions Suggested by the Authors:
- Integrate uncertainty‑aware perception modules to make the graph robust to detection errors.
- Explore hierarchical graph constructions that abstract groups of objects into “compound nodes.”
- Combine the SGTG with reinforcement learning to fine‑tune the decoder’s action proposals based on actual execution feedback.

Bottom line: By marrying semantic task graphs with modern neural encoders/decoders, this work offers a practical pathway for developers to give robots a deeper, more flexible understanding of human demonstrations—moving us a step closer to truly adaptable, task‑agnostic manipulation systems.

Authors

Franziska Herbert
Vignesh Prasad
Han Liu
Dorothea Koert
Georgia Chalvatzaki

Paper Information

arXiv ID: 2601.11460v1
Categories: cs.RO, cs.LG
Published: January 16, 2026
PDF: Download PDF

[Paper] Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management