[Paper] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
Source: arXiv - 2603.01714v1
Overview
The paper introduces TopoCurate, a new way to train AI agents that can use tools (e.g., virtual hands, APIs, or robotic manipulators). Instead of only looking at whether a rollout ends successfully, TopoCurate builds a topological map of how actions and observations interact across many attempts, allowing the trainer to pick the most informative experiences for both supervised fine‑tuning (SFT) and reinforcement learning (RL).
Key Contributions
- Interaction‑aware topology: Projects all rollouts of a task into a unified semantic graph that merges equivalent action‑observation states, turning scattered trajectories into a structured manifold.
- Dual‑selection curriculum:
- SFT selector prefers trajectories that show error recovery, semantic efficiency, and strategic diversity, reducing covariate shift and mode collapse.
- RL selector favors tasks with high “error‑branch” ratios and diverse strategies, boosting the gradient signal‑to‑noise ratio in sparse‑reward environments.
- Empirical gains: Demonstrates consistent improvements of +4.2 % on SFT and +6.9 % on RL benchmarks (BFCLv3, Tau2) compared with the strongest existing baselines.
- Open resources: Plans to release code, data, and the topology construction pipeline for community use.
Methodology
- Collect multi‑trial rollouts for each tool‑use task (e.g., “pick‑up‑cup”, “open‑door”).
- Semantic quotient projection:
- Encode each (action, observation) pair with a pretrained language‑vision model.
- Cluster pairs that are semantically equivalent (e.g., “grasp‑handle” vs. “grab‑handle”).
- Merge clusters into nodes of a graph; edges represent temporal transitions.
- The resulting graph captures how the agent’s interactions diverge into success or failure branches.
- Dual‑selection mechanisms:
- SFT selector traverses the graph to find paths that contain recovery loops (e.g., “failed‑grasp → adjust → retry”) and efficient sub‑paths (minimal redundant steps). It also enforces diversity by sampling from different strategic regions of the graph.
- RL selector computes the proportion of edges that belong to failure branches (error‑branch ratio) and the entropy of the strategy distribution. Tasks with high ratios and entropy are chosen for RL updates, ensuring richer gradient signals.
- Training loop: Selected SFT trajectories fine‑tune the policy, then the RL selector supplies high‑signal tasks for policy‑gradient updates. The process iterates until convergence.
Results & Findings
| Setting | Baseline (SFT) | TopoCurate (SFT) | Baseline (RL) | TopoCurate (RL) |
|---|---|---|---|---|
| BFCLv3 | 71.3 % pass | 75.5 % (+4.2) | 0.42 % reward | 0.48 % (+6.9) |
| Tau2 | 68.7 % pass | 72.9 % (+4.2) | 0.38 % reward | 0.51 % (+6.9) |
- Higher success rates on SFT indicate that the curated trajectories teach the model more robust recovery behaviors.
- Improved RL rewards show that the error‑branch‑rich tasks provide stronger learning signals, mitigating the classic sparse‑reward problem.
- Ablation studies confirm that both the topology projection and the dual‑selection criteria contribute additively to the gains.
Practical Implications
- More reliable tool‑use agents: Developers building virtual assistants, game AI, or robotic controllers can obtain policies that gracefully recover from mistakes instead of simply “getting lucky.”
- Reduced data waste: By automatically filtering out redundant or trivial rollouts, training pipelines become more sample‑efficient, cutting compute costs.
- Curriculum design for RL: The error‑branch ratio metric offers a simple, interpretable way to prioritize challenging tasks, which can be plugged into existing RL frameworks (e.g., OpenAI Gym, RLlib).
- Cross‑domain applicability: The topology construction works with any modality where actions and observations can be embedded (text, vision, proprioception), making it suitable for multimodal tool‑use scenarios such as code‑generation agents that invoke APIs.
Limitations & Future Work
- Scalability of graph construction: The clustering step can become expensive for very long horizons or massive datasets; approximate clustering or streaming graph updates are needed.
- Dependence on embedding quality: Semantic equivalence relies on pretrained encoders; domain‑specific vocabularies may require fine‑tuning of those encoders.
- Benchmarks limited to simulated environments: Real‑world robotic validation is still pending, and the authors note that sensor noise may affect topology stability.
- Future directions include extending TopoCurate to hierarchical tool‑use (nested sub‑tasks), integrating human‑in‑the‑loop feedback for topology refinement, and exploring continual‑learning setups where the topology evolves over time.
Authors
- Jinluan Yang
- Yuxin Liu
- Zhengyu Chen
- Chengcheng Han
- Yueqing Sun
- Qi Gu
- Hui Su
- Xunliang Cai
- Fei Wu
- Kun Kuang
Paper Information
- arXiv ID: 2603.01714v1
- Categories: cs.LG, cs.CL
- Published: March 2, 2026
- PDF: Download PDF