[Paper] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Published: 1 day ago (March 2, 2026 at 05:38 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.01714v1

Overview

The paper introduces TopoCurate, a new way to train AI agents that can use tools (e.g., virtual hands, APIs, or robotic manipulators). Instead of only looking at whether a rollout ends successfully, TopoCurate builds a topological map of how actions and observations interact across many attempts, allowing the trainer to pick the most informative experiences for both supervised fine‑tuning (SFT) and reinforcement learning (RL).

Key Contributions

Interaction‑aware topology: Projects all rollouts of a task into a unified semantic graph that merges equivalent action‑observation states, turning scattered trajectories into a structured manifold.
Dual‑selection curriculum:
- SFT selector prefers trajectories that show error recovery, semantic efficiency, and strategic diversity, reducing covariate shift and mode collapse.
- RL selector favors tasks with high “error‑branch” ratios and diverse strategies, boosting the gradient signal‑to‑noise ratio in sparse‑reward environments.
Empirical gains: Demonstrates consistent improvements of +4.2 % on SFT and +6.9 % on RL benchmarks (BFCLv3, Tau2) compared with the strongest existing baselines.
Open resources: Plans to release code, data, and the topology construction pipeline for community use.

Methodology

Collect multi‑trial rollouts for each tool‑use task (e.g., “pick‑up‑cup”, “open‑door”).
Semantic quotient projection:
- Encode each (action, observation) pair with a pretrained language‑vision model.
- Cluster pairs that are semantically equivalent (e.g., “grasp‑handle” vs. “grab‑handle”).
- Merge clusters into nodes of a graph; edges represent temporal transitions.
- The resulting graph captures how the agent’s interactions diverge into success or failure branches.
Dual‑selection mechanisms:
- SFT selector traverses the graph to find paths that contain recovery loops (e.g., “failed‑grasp → adjust → retry”) and efficient sub‑paths (minimal redundant steps). It also enforces diversity by sampling from different strategic regions of the graph.
- RL selector computes the proportion of edges that belong to failure branches (error‑branch ratio) and the entropy of the strategy distribution. Tasks with high ratios and entropy are chosen for RL updates, ensuring richer gradient signals.
Training loop: Selected SFT trajectories fine‑tune the policy, then the RL selector supplies high‑signal tasks for policy‑gradient updates. The process iterates until convergence.

Results & Findings

Setting	Baseline (SFT)	TopoCurate (SFT)	Baseline (RL)	TopoCurate (RL)
BFCLv3	71.3 % pass	75.5 % (+4.2)	0.42 % reward	0.48 % (+6.9)
Tau2	68.7 % pass	72.9 % (+4.2)	0.38 % reward	0.51 % (+6.9)

Higher success rates on SFT indicate that the curated trajectories teach the model more robust recovery behaviors.
Improved RL rewards show that the error‑branch‑rich tasks provide stronger learning signals, mitigating the classic sparse‑reward problem.
Ablation studies confirm that both the topology projection and the dual‑selection criteria contribute additively to the gains.

Practical Implications

More reliable tool‑use agents: Developers building virtual assistants, game AI, or robotic controllers can obtain policies that gracefully recover from mistakes instead of simply “getting lucky.”
Reduced data waste: By automatically filtering out redundant or trivial rollouts, training pipelines become more sample‑efficient, cutting compute costs.
Curriculum design for RL: The error‑branch ratio metric offers a simple, interpretable way to prioritize challenging tasks, which can be plugged into existing RL frameworks (e.g., OpenAI Gym, RLlib).
Cross‑domain applicability: The topology construction works with any modality where actions and observations can be embedded (text, vision, proprioception), making it suitable for multimodal tool‑use scenarios such as code‑generation agents that invoke APIs.

Limitations & Future Work

Scalability of graph construction: The clustering step can become expensive for very long horizons or massive datasets; approximate clustering or streaming graph updates are needed.
Dependence on embedding quality: Semantic equivalence relies on pretrained encoders; domain‑specific vocabularies may require fine‑tuning of those encoders.
Benchmarks limited to simulated environments: Real‑world robotic validation is still pending, and the authors note that sensor noise may affect topology stability.
Future directions include extending TopoCurate to hierarchical tool‑use (nested sub‑tasks), integrating human‑in‑the‑loop feedback for topology refinement, and exploring continual‑learning setups where the topology evolves over time.

Authors

Jinluan Yang
Yuxin Liu
Zhengyu Chen
Chengcheng Han
Yueqing Sun
Qi Gu
Hui Su
Xunliang Cai
Fei Wu
Kun Kuang

Paper Information

arXiv ID: 2603.01714v1
Categories: cs.LG, cs.CL
Published: March 2, 2026
PDF: Download PDF

[Paper] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

[Paper] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations