[Paper] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Published: (January 5, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02356v1

Overview

Talk2Move is a new reinforcement‑learning‑driven diffusion system that lets you move, rotate, or resize objects in an image simply by describing the desired change in natural language. By sidestepping the need for large collections of paired “before‑and‑after” images, the approach opens the door to more flexible, text‑driven scene editing tools that work at the level of individual objects rather than just overall style or color.

Key Contributions

  • GRPO (Group Relative Policy Optimization): A novel RL algorithm that explores geometric actions (translation, rotation, scaling) through diverse rollouts generated from a single input image and lightweight textual prompts.
  • Spatial Reward Function: An object‑centric reward that directly measures displacement, rotation, and scaling consistency with the language instruction, providing interpretable feedback to the model.
  • Off‑policy Step Evaluation & Active Step Sampling: Techniques that focus learning on the most informative transformation stages, dramatically improving sample efficiency.
  • Diffusion‑based Generation without Paired Supervision: The system learns to edit geometry purely from unpaired data, eliminating the costly collection of annotated “before/after” pairs.
  • Benchmark Suite for Text‑Guided Geometric Editing: Curated datasets and evaluation metrics that quantify spatial accuracy, semantic fidelity, and overall scene coherence.

Methodology

Talk2Move builds on a diffusion model that generates images conditioned on both an input picture and a textual command (e.g., “move the chair 30 cm to the left”). The core loop works as follows:

  1. Action Space Definition: The model can apply three primitive geometric actions to any detected object: translate (Δx, Δy), rotate (θ), and scale (s).
  2. Policy Learning via GRPO: Instead of a single deterministic policy, GRPO samples a group of candidate actions, evaluates them with a spatial reward, and updates the policy based on the relative advantage of each action compared to the group mean. This reduces variance and encourages exploration of diverse transformations.
  3. Spatial Reward Computation: After each action, a lightweight object detector extracts the updated bounding box and pose. The reward combines three terms:
    • Displacement error (distance between predicted and language‑specified translation)
    • Rotation error (angular deviation)
    • Scale error (relative size change)
      The reward is normalized to be interpretable (higher = better alignment).
  4. Off‑policy Evaluation & Active Sampling: The system re‑uses past rollouts (off‑policy) to estimate the value of actions that were not taken, and it actively samples steps that are expected to provide the highest learning signal (e.g., early‑stage large moves).
  5. Diffusion Decoding: The final transformed latent is passed through the diffusion decoder, producing a photorealistic image where the target object has been geometrically altered while the rest of the scene remains coherent.

Results & Findings

  • Spatial Accuracy: Talk2Move reduces average translation error by ~35 % and rotation error by ~28 % compared with the strongest text‑guided baselines (e.g., InstructPix2Pix, Text2Live).
  • Semantic Faithfulness: Human evaluators rated the edited images as “semantically correct” 92 % of the time, versus 71 % for competing methods.
  • Scene Coherence: The diffusion backbone preserves lighting, shadows, and occlusions, resulting in a 0.84 LPIPS similarity to ground‑truth edits (vs. 0.67 for baselines).
  • Efficiency: Thanks to off‑policy evaluation and active step sampling, the model converges in roughly half the training iterations required by a vanilla RL‑diffusion pipeline.

Practical Implications

  • Interactive Design Tools: UI/UX designers could embed Talk2Move into image editors, allowing rapid prototyping of layout changes (“move the sofa to the right”) without manual masking or 3D modeling.
  • Game Asset Adjustment: Game developers can programmatically reposition or resize objects in concept art or level mock‑ups via simple scripts that generate natural‑language commands.
  • AR/VR Scene Editing: Real‑time AR applications could let users verbally rearrange virtual furniture in a captured room, with the model handling occlusion and lighting consistency.
  • Data Augmentation: Synthetic geometric variations generated from textual descriptions can enrich training sets for downstream tasks like object detection or pose estimation.

Limitations & Future Work

  • Object Detection Dependency: The quality of the spatial reward hinges on accurate bounding‑box and pose estimation; errors propagate to the RL loop.
  • Limited to Rigid Transformations: Current actions cover only translation, rotation, and uniform scaling; non‑rigid deformations (e.g., bending a lamp) remain out of scope.
  • Scalability to Complex Scenes: Performance degrades when many objects overlap heavily, as disentangling individual transformations becomes ambiguous.
  • Future Directions: The authors suggest integrating more expressive 3D‑aware representations, extending the action space to include deformation primitives, and exploring multimodal feedback (e.g., voice or gesture) to further reduce reliance on perfect object detection.

Authors

  • Jing Tan
  • Zhaoyang Zhang
  • Yantao Shen
  • Jiarui Cai
  • Shuo Yang
  • Jiajun Wu
  • Wei Xia
  • Zhuowen Tu
  • Stefano Soatto

Paper Information

  • arXiv ID: 2601.02356v1
  • Categories: cs.CV
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »