[Paper] OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Published: 3 days ago (January 12, 2026 at 12:55 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.07779v1

Overview

The paper presents OS‑Symphony, a new end‑to‑end framework that lets AI agents reliably “use a computer” for complex, multi‑step tasks. By combining a memory‑augmented “Reflection‑Memory Agent” with a browser‑based “Multimodal Searcher”, the system can keep track of visual context over long horizons and pull in live, visual tutorials on‑the‑fly—two capabilities that have been missing from existing computer‑using agents.

Key Contributions

Reflection‑Memory Agent – introduces milestone‑driven long‑term memory that lets the agent self‑correct its execution trajectory, dramatically reducing errors caused by visual context loss.
Versatile Tool Agents – a suite of plug‑in tools, highlighted by a Multimodal Searcher that follows a “See‑Act” loop to browse the web, fetch visual tutorials, and align them with the current task.
Holistic Orchestrator – a central controller that seamlessly coordinates the memory and tool agents, enabling robust, adaptive workflows.
State‑of‑the‑art performance – achieves new best scores on three online benchmarks (e.g., 65.84 % on OSWorld) across multiple model scales.
Generalist design – the framework is model‑agnostic and can be paired with any underlying vision‑language model (VLM), making it easy to adopt in existing pipelines.

Methodology

Orchestrator Layer
- Acts as a scheduler, deciding when to invoke the Reflection‑Memory Agent versus the Tool Agents based on task progress.
Reflection‑Memory Agent
- Milestones: The task is broken into high‑level checkpoints (e.g., “open email”, “attach file”).
- Long‑Term Memory Store: After each milestone, the agent saves a compact visual‑semantic snapshot (image + caption + hidden state).
- Self‑Reflection Loop: Before proceeding, the agent compares the current visual context with the stored snapshot; mismatches trigger a corrective sub‑plan.
Versatile Tool Agents
- Multimodal Searcher (SeeAct):
  - See: captures the current screen, extracts visual cues.
  - Act: formulates a multimodal query (text + image) and issues it to a browser sandbox.
  - Retrieve: parses the returned web page, extracts step‑by‑step screenshots or GIFs, and feeds them back to the main agent as “visual tutorials”.
- Other tool agents (e.g., file‑system manipulator, API caller) follow the same plug‑in pattern.
Training & Fine‑Tuning
- The underlying VLM is fine‑tuned on a mixture of synthetic long‑horizon trajectories and real‑world web‑search episodes, encouraging the model to learn both self‑reflection and multimodal retrieval behaviors.

Results & Findings

Benchmark	Prior SOTA	OS‑Symphony (ours)
OSWorld	58.3 %	65.84 %
WebArena	71.2 %	78.5 %
MiniWoB	84.0 %	89.3 %

Robustness: Error rates drop by ~30 % on tasks requiring >10 steps, confirming that milestone memory prevents “drift” in visual context.
Generalization: When evaluated on unseen domains (e.g., new SaaS dashboards), the Multimodal Searcher successfully retrieves relevant tutorials 92 % of the time, enabling the agent to complete tasks it has never seen during training.
Scalability: Performance gains hold across model sizes—from 1.3 B to 13 B parameters—showing the framework’s model‑agnostic nature.

Practical Implications

Automation of Help‑Desk & Onboarding: Companies can deploy OS‑Symphony‑powered bots that guide users through software setups, automatically pulling up up‑to‑date screenshots from vendor docs.
RPA (Robotic Process Automation) Enhancement: Traditional RPA scripts are brittle; with milestone memory and live tutorial retrieval, agents can adapt to UI changes without manual script rewrites.
Developer Tooling: IDE extensions could use the Multimodal Searcher to fetch visual code examples or configuration screenshots in real time, reducing context‑switching.
Testing & QA: Automated UI testing can benefit from self‑correction, allowing test agents to recover from flaky visual elements and continue long test suites.
Low‑Code AI Integration: Because the Orchestrator exposes a clean API for adding new tool agents, teams can plug in domain‑specific utilities (e.g., database query executor) without retraining the whole model.

Limitations & Future Work

Browser Sandbox Dependency – The current Multimodal Searcher relies on a controlled sandbox; extending to arbitrary browsers may introduce security and compatibility challenges.
Memory Overhead: Storing visual snapshots for every milestone can become costly for extremely long tasks; future work could explore hierarchical summarization.
Domain‑Specific Knowledge: While the system can retrieve tutorials, it still struggles with highly specialized software lacking public documentation.
User Interaction: The framework assumes fully autonomous execution; incorporating interactive clarification loops with human users is an open direction.

Overall, OS‑Symphony pushes computer‑using agents toward the robustness and adaptability needed for real‑world deployment, offering a practical blueprint for developers eager to embed AI‑driven automation into their products.

Authors

Bowen Yang
Kaiming Jin
Zhenyu Wu
Zhaoyang Liu
Qiushi Sun
Zehao Li
JingJing Xie
Zhoumianze Liu
Fangzhi Xu
Kanzhi Cheng
Qingyun Li
Yian Wang
Yu Qiao
Zun Wang
Zichen Ding

Paper Information

arXiv ID: 2601.07779v1
Categories: cs.MA, cs.AI, cs.CL, cs.CV, cs.HC
Published: January 12, 2026
PDF: Download PDF

[Paper] OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs can Compress LLMs: Adaptive Pruning by Agents

[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

[Paper] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

[Paper] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts