[Paper] Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence
Source: arXiv - 2605.06230v1
Overview
The paper introduces Safactory, a unified, scalable “agent factory” that ties together simulation, data management, and continuous learning for autonomous AI agents. By stitching these pieces into a single pipeline, the authors aim to make it easier to evaluate, improve, and trust large‑model agents that operate over long horizons and interact with real‑world tools.
Key Contributions
- Parallel Simulation Platform – Generates massive, diverse interaction trajectories in parallel, enabling high‑throughput testing of long‑horizon decision making.
- Trustworthy Data Platform – Stores raw trajectories, extracts structured experiences, and attaches provenance/quality metadata for systematic risk analysis.
- Autonomous Evolution Platform – Runs asynchronous reinforcement‑learning (RL) loops and on‑policy distillation, turning collected experiences into continuously upgraded models.
- Unified Evolutionary Pipeline – First framework that couples simulation, data curation, and model evolution end‑to‑end, supporting closed‑loop improvement of trustworthy agents.
- Scalability Demonstration – Shows the system can handle millions of simulated episodes across heterogeneous compute clusters without manual orchestration.
Methodology
- Massively Parallel Simulations – Safactory launches thousands of sandboxed environments (e.g., web browsers, tool‑using APIs) on a distributed cluster. Each environment runs an autonomous agent that follows a policy and logs its full action‑state trajectory.
- Experience Extraction & Curation – The raw logs are ingested by the Trustworthy Data Platform, which parses them into “experiences” (state, action, reward, tool usage) and tags each with reliability signals (e.g., simulation fidelity, safety violations).
- Closed‑Loop Learning – The Autonomous Evolution Platform pulls curated experiences into an RL trainer. It runs asynchronous updates:
- Policy Gradient / PPO on the collected on‑policy data.
- Distillation of the updated policy back into a smaller, more deployable model.
- Safety Filters that reject updates that increase measured risk metrics.
- Iterative Feedback – Updated models are automatically redeployed to the simulation fleet, creating a continuous loop of generation → evaluation → improvement.
The whole stack is orchestrated via a lightweight task scheduler and containerized services, making it portable across cloud providers or on‑prem clusters.
Results & Findings
| Metric | Baseline (single‑agent pipeline) | Safactory (full pipeline) |
|---|---|---|
| Episodes per day (≈) | 10 K | 2.3 M |
| Average task success rate (long‑horizon) | 62 % | 78 % |
| Detected safety violations (per 10 K episodes) | 1.8 % | 0.4 % |
| Model improvement latency (days) | 7 | 1.2 |
- Throughput boost: Parallel simulation gave >200× more data per day, dramatically accelerating RL updates.
- Performance lift: Agents trained in the closed loop solved more complex multi‑step tasks (e.g., multi‑tool workflows) than those trained on static datasets.
- Risk reduction: The Trustworthy Data Platform’s safety tags enabled the evolution engine to filter out harmful policy updates, cutting violation rates by ~78 %.
These numbers illustrate that a tightly coupled pipeline can both speed up learning and improve safety guarantees.
Practical Implications
- Accelerated product development – Companies building AI assistants, autonomous bots, or tool‑using agents can iterate from prototype to production in days rather than weeks.
- Continuous compliance – By embedding safety metrics into the data platform, organizations can maintain audit trails and automatically enforce regulatory constraints during model updates.
- Cost‑effective scaling – The modular, container‑based design lets teams spin up additional simulation workers on spot instances, achieving high throughput without massive capital expense.
- Plug‑and‑play for existing models – Safactory’s APIs accept any language model that can be wrapped as an “agent policy,” making it straightforward to retrofit legacy systems with a trustworthy evolution loop.
- Foundation for industry standards – A unified pipeline could become a reference implementation for benchmark suites (e.g., OpenAI’s “AgentBench”) and for sharing reproducible evaluation data across firms.
Limitations & Future Work
- Simulation fidelity – The current sandbox environments are still approximations of the real world; gaps may cause “reality drift” when deploying to physical systems.
- Resource heterogeneity – While the scheduler handles mixed CPU/GPU clusters, extreme scale (hundreds of GPUs) can expose bottlenecks in the data ingestion layer.
- Safety metric design – The paper relies on handcrafted risk signals; learning more nuanced safety representations remains an open challenge.
- Generalization to non‑tool‑using agents – The framework is optimized for agents that invoke external tools; extending it to pure perception‑action loops (e.g., robotics) will require additional sensor simulators.
Future work outlined by the authors includes tighter integration with real‑world testbeds, automated safety metric discovery via meta‑learning, and open‑sourcing the platform to foster community‑driven extensions.
Authors
- Xinquan Chen
- Zhenyun Yin
- Shan He
- Bin Huang
- Shanzhe Lei
- Pengcheng Shi
- Kun Cai
- Bei Chen
- Bangwei Liu
- Zeyu Kang
- Chao Huang
- Yang Zhang
- Wenjie Li
- Ruijun Ge
- Yajie Wang
- Tianshun Fang
- Tianyang Xu
- Yiwen Cong
- Meng Jin
- Gaolei Li
- Xuansheng Wu
- Linhan Liu
- Zijing He
- An Li
- Yan Teng
- Xin Tan
- ChaoChao Lu
- Ji He
- Jie Li
- Chunfeng Song
- Jinya Xu
- Fan Song
- Shujie Wang
- Jianmin Qian
- Jie Hou
- Xuhong Wang
- Yingchun Wang
- Hui Wang
- Xia Hu
Paper Information
- arXiv ID: 2605.06230v1
- Categories: cs.AI, cs.DC
- Published: May 7, 2026
- PDF: Download PDF