[Paper] Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Published: (January 16, 2026 at 12:35 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2601.11460v1

Overview

This paper tackles a core problem for robot manipulation: how to turn raw human demonstration videos into a compact, reusable representation of a task that captures what is being done (the semantics) and how objects move and relate to each other (the geometry). By introducing a semantic‑geometric task graph and a learning pipeline that separates scene understanding from action planning, the authors show that robots can predict and execute long‑horizon, bimanual tasks more reliably than with plain sequence models.

Key Contributions

  • Semantic‑Geometric Task Graph (SGTG): A unified graph structure that encodes object identities, pairwise spatial relations, and their temporal evolution across a demonstration.
  • Hybrid Encoder‑Decoder Architecture:
    • Encoder: A Message Passing Neural Network (MPNN) that ingests only the temporal scene graphs, learning a structured latent embedding of the task.
    • Decoder: A Transformer that conditions on the current action context to forecast future actions, involved objects, and their motions.
  • Decoupling of Perception and Reasoning: By learning scene representations independently of the action‑conditioned decoder, the model can be reused across different downstream planners or control loops.
  • Empirical Validation on Human Demonstrations: Demonstrates superior performance on datasets with high variability in action order and object interactions, where traditional sequence‑based baselines falter.
  • Real‑World Transfer: Shows that the learned task graphs can be deployed on a physical bimanual robot for online action selection, proving the approach is more than a simulation curiosity.

Methodology

  1. Data Preparation – Temporal Scene Graphs

    • Each frame of a demonstration is parsed into a graph: nodes = objects (with class labels), edges = geometric relations (e.g., distance, relative pose).
    • Over time, these graphs form a temporal sequence that captures how relations evolve (e.g., a cup moving toward a hand).
  2. Encoder – Message Passing Neural Network

    • The MPNN aggregates information across nodes and edges at each timestep, producing a compact embedding that respects the graph’s structure.
    • Temporal dynamics are captured by feeding the per‑timestep embeddings into a recurrent module (or a simple temporal pooling).
  3. Decoder – Action‑Conditioned Transformer

    • Takes the task embedding and a prompt (the current action or a partial plan) as input.
    • Autoregressively predicts the next action token, the set of objects involved, and a parametric description of the expected object motion (e.g., a 6‑DoF pose delta).
    • The Transformer’s self‑attention lets the model reason about long‑range dependencies (e.g., “pick the cup only after the spoon is placed”).
  4. Training Objective

    • Multi‑task loss: cross‑entropy for action and object classification + regression loss for geometric motion predictions.
    • Teacher‑forcing during training ensures the decoder sees the ground‑truth previous actions, while at test time it operates in a fully autoregressive mode.
  5. Deployment on a Bimanual Robot

    • The learned graph encoder runs on perception data (RGB‑D + object detection) to produce a task embedding in real time.
    • The decoder supplies the next action command, which is fed to a low‑level controller that executes the motion on the robot’s two arms.

Results & Findings

MetricSequence‑Only BaselineGraph‑Based Model (Ours)
Top‑1 Action Accuracy (high‑variability tasks)62 %78 %
Object‑Selection F155 %71 %
Motion Prediction MAE (cm)3.41.9
Planning Horizon (steps correctly predicted)47
  • Robustness to Variability: When the same task is demonstrated with different object orders or hand‑over‑hand swaps, the graph model maintains high accuracy, whereas sequence models drop sharply.
  • Generalization to Unseen Objects: Because the encoder learns relational patterns rather than raw pixel sequences, it can extrapolate to new objects that share similar geometric roles (e.g., a different mug).
  • Real‑World Trial: On a dual‑arm platform, the robot successfully assembled a “plate‑and‑cutlery” setup from human demos, achieving an 85 % success rate over 30 trials, compared to 60 % for a flat‑sequence LSTM baseline.

Practical Implications

  • Reusable Task Abstractions: Developers can store an SGTG embedding once a task is demonstrated and reuse it across multiple robots or simulation environments without retraining the whole pipeline.
  • Plug‑and‑Play Planning: Since the decoder is action‑conditioned, it can be swapped with existing task‑level planners (e.g., behavior trees) that provide the “prompt” context.
  • Better Generalization for Home‑Robotics: Household robots often encounter novel object arrangements; a graph‑centric view lets them infer appropriate actions even when the exact sequence was never seen.
  • Scalable Data Collection: Human tele‑operation or video capture can be turned into scene graphs automatically (using off‑the‑shelf object detectors), reducing the need for hand‑crafted annotations.
  • Potential for Multi‑Agent Coordination: The same representation could be extended to coordinate multiple robots (or humans) by adding agent nodes and inter‑agent edges, opening doors for collaborative manufacturing or assistive care.

Limitations & Future Work

  • Reliance on Accurate Perception: The pipeline assumes reliable object detection and pose estimation; noisy sensors can corrupt the scene graph and degrade performance.
  • Fixed Graph Topology: Current graphs only model pairwise relations; higher‑order interactions (e.g., three‑object constraints) are not explicitly captured.
  • Scalability to Very Long Horizons: While the Transformer handles longer sequences better than RNNs, inference time grows with horizon length, which may be problematic for real‑time control in complex tasks.
  • Future Directions Suggested by the Authors:
    • Integrate uncertainty‑aware perception modules to make the graph robust to detection errors.
    • Explore hierarchical graph constructions that abstract groups of objects into “compound nodes.”
    • Combine the SGTG with reinforcement learning to fine‑tune the decoder’s action proposals based on actual execution feedback.

Bottom line: By marrying semantic task graphs with modern neural encoders/decoders, this work offers a practical pathway for developers to give robots a deeper, more flexible understanding of human demonstrations—moving us a step closer to truly adaptable, task‑agnostic manipulation systems.

Authors

  • Franziska Herbert
  • Vignesh Prasad
  • Han Liu
  • Dorothea Koert
  • Georgia Chalvatzaki

Paper Information

  • arXiv ID: 2601.11460v1
  • Categories: cs.RO, cs.LG
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »