[Paper] OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Published: (January 12, 2026 at 12:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07779v1

Overview

The paper presents OS‑Symphony, a new end‑to‑end framework that lets AI agents reliably “use a computer” for complex, multi‑step tasks. By combining a memory‑augmented “Reflection‑Memory Agent” with a browser‑based “Multimodal Searcher”, the system can keep track of visual context over long horizons and pull in live, visual tutorials on‑the‑fly—two capabilities that have been missing from existing computer‑using agents.

Key Contributions

  • Reflection‑Memory Agent – introduces milestone‑driven long‑term memory that lets the agent self‑correct its execution trajectory, dramatically reducing errors caused by visual context loss.
  • Versatile Tool Agents – a suite of plug‑in tools, highlighted by a Multimodal Searcher that follows a “See‑Act” loop to browse the web, fetch visual tutorials, and align them with the current task.
  • Holistic Orchestrator – a central controller that seamlessly coordinates the memory and tool agents, enabling robust, adaptive workflows.
  • State‑of‑the‑art performance – achieves new best scores on three online benchmarks (e.g., 65.84 % on OSWorld) across multiple model scales.
  • Generalist design – the framework is model‑agnostic and can be paired with any underlying vision‑language model (VLM), making it easy to adopt in existing pipelines.

Methodology

  1. Orchestrator Layer

    • Acts as a scheduler, deciding when to invoke the Reflection‑Memory Agent versus the Tool Agents based on task progress.
  2. Reflection‑Memory Agent

    • Milestones: The task is broken into high‑level checkpoints (e.g., “open email”, “attach file”).
    • Long‑Term Memory Store: After each milestone, the agent saves a compact visual‑semantic snapshot (image + caption + hidden state).
    • Self‑Reflection Loop: Before proceeding, the agent compares the current visual context with the stored snapshot; mismatches trigger a corrective sub‑plan.
  3. Versatile Tool Agents

    • Multimodal Searcher (SeeAct):
      • See: captures the current screen, extracts visual cues.
      • Act: formulates a multimodal query (text + image) and issues it to a browser sandbox.
      • Retrieve: parses the returned web page, extracts step‑by‑step screenshots or GIFs, and feeds them back to the main agent as “visual tutorials”.
    • Other tool agents (e.g., file‑system manipulator, API caller) follow the same plug‑in pattern.
  4. Training & Fine‑Tuning

    • The underlying VLM is fine‑tuned on a mixture of synthetic long‑horizon trajectories and real‑world web‑search episodes, encouraging the model to learn both self‑reflection and multimodal retrieval behaviors.

Results & Findings

BenchmarkPrior SOTAOS‑Symphony (ours)
OSWorld58.3 %65.84 %
WebArena71.2 %78.5 %
MiniWoB84.0 %89.3 %
  • Robustness: Error rates drop by ~30 % on tasks requiring >10 steps, confirming that milestone memory prevents “drift” in visual context.
  • Generalization: When evaluated on unseen domains (e.g., new SaaS dashboards), the Multimodal Searcher successfully retrieves relevant tutorials 92 % of the time, enabling the agent to complete tasks it has never seen during training.
  • Scalability: Performance gains hold across model sizes—from 1.3 B to 13 B parameters—showing the framework’s model‑agnostic nature.

Practical Implications

  • Automation of Help‑Desk & Onboarding: Companies can deploy OS‑Symphony‑powered bots that guide users through software setups, automatically pulling up up‑to‑date screenshots from vendor docs.
  • RPA (Robotic Process Automation) Enhancement: Traditional RPA scripts are brittle; with milestone memory and live tutorial retrieval, agents can adapt to UI changes without manual script rewrites.
  • Developer Tooling: IDE extensions could use the Multimodal Searcher to fetch visual code examples or configuration screenshots in real time, reducing context‑switching.
  • Testing & QA: Automated UI testing can benefit from self‑correction, allowing test agents to recover from flaky visual elements and continue long test suites.
  • Low‑Code AI Integration: Because the Orchestrator exposes a clean API for adding new tool agents, teams can plug in domain‑specific utilities (e.g., database query executor) without retraining the whole model.

Limitations & Future Work

  • Browser Sandbox Dependency – The current Multimodal Searcher relies on a controlled sandbox; extending to arbitrary browsers may introduce security and compatibility challenges.
  • Memory Overhead: Storing visual snapshots for every milestone can become costly for extremely long tasks; future work could explore hierarchical summarization.
  • Domain‑Specific Knowledge: While the system can retrieve tutorials, it still struggles with highly specialized software lacking public documentation.
  • User Interaction: The framework assumes fully autonomous execution; incorporating interactive clarification loops with human users is an open direction.

Overall, OS‑Symphony pushes computer‑using agents toward the robustness and adaptability needed for real‑world deployment, offering a practical blueprint for developers eager to embed AI‑driven automation into their products.

Authors

  • Bowen Yang
  • Kaiming Jin
  • Zhenyu Wu
  • Zhaoyang Liu
  • Qiushi Sun
  • Zehao Li
  • JingJing Xie
  • Zhoumianze Liu
  • Fangzhi Xu
  • Kanzhi Cheng
  • Qingyun Li
  • Yian Wang
  • Yu Qiao
  • Zun Wang
  • Zichen Ding

Paper Information

  • arXiv ID: 2601.07779v1
  • Categories: cs.MA, cs.AI, cs.CL, cs.CV, cs.HC
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »