[Paper] Toward Autonomous Long-Horizon Engineering for ML Research
Source: arXiv - 2604.13018v1
Overview
The paper presents AiScientist, a new framework that lets autonomous agents carry out multi‑day, end‑to‑end machine‑learning research projects. By combining a hierarchical orchestrator with a “File‑as‑Bus” workspace that preserves durable artifacts (code, data, analysis, plans), the system can keep a coherent state across many subtasks—something prior agents struggled with.
Key Contributions
- Hierarchical orchestration: A top‑level Orchestrator guides the workflow while specialized agents handle concrete subtasks (data prep, model coding, experiment running, debugging).
- File‑as‑Bus workspace: All agents read/write a shared, permission‑scoped file system that acts as the single source of truth, ensuring state continuity over hours or days.
- State‑driven re‑grounding: Agents repeatedly re‑evaluate the latest artifacts instead of relying on fleeting conversational context, enabling “thin control over thick state.”
- Benchmark improvements: On the PaperBench suite, AiScientist lifts the average score by 10.54 points over the strongest baseline; on MLE‑Bench Lite it attains 81.82 % Any‑Medal.
- Ablation evidence: Removing the File‑as‑Bus protocol drops performance by 6.41 points (PaperBench) and 31.82 % (MLE‑Bench Lite), confirming its central role.
Methodology
- Orchestrator Layer – Maintains a high‑level roadmap (e.g., “understand problem → set up environment → implement model → run experiments → debug”). It produces concise summaries and a workspace map that tell downstream agents what files they may read/write.
- Specialized Agents – Each agent is a language‑model‑driven tool (e.g., a code generator, a data‑loader, a debugger). When invoked, an agent re‑grounds on the current workspace contents: it loads the latest analysis, plan, or experiment logs, then produces or updates files accordingly.
- File‑as‑Bus Protocol – The workspace is a hierarchical directory with explicit read/write permissions. Files are the only communication channel; there is no hidden “conversation memory.” This design forces every piece of knowledge to be persisted as a durable artifact.
- Iterative Loop – The orchestrator monitors progress, updates the roadmap, and triggers agents until a stopping condition (e.g., target metric reached or time budget exhausted) is met.
The whole pipeline is implemented with off‑the‑shelf LLM APIs and a lightweight file‑system wrapper, making it reproducible on standard cloud VMs.
Results & Findings
| Benchmark | Baseline (best) | AiScientist | Δ (points/%) |
|---|---|---|---|
| PaperBench | 68.3 | 78.8 | +10.54 |
| MLE‑Bench Lite (Any Medal) | 50.0 % | 81.82 % | +31.82 % |
- Ablation: Turning off the File‑as‑Bus (agents communicate only via prompts) reduces PaperBench to 72.4 and MLE‑Bench Lite to 50 %, highlighting that durable state is the main performance driver.
- Error analysis showed that most failures after ablation stemmed from lost context (e.g., forgetting a hyper‑parameter tweak made earlier).
- Scalability test: Extending a single experiment from 2 hours to 24 hours showed linear growth in completed subtasks, confirming that the orchestrator can sustain long‑horizon runs without drift.
Practical Implications
- Accelerated prototyping – Teams can offload repetitive engineering (environment setup, boilerplate code, routine hyper‑parameter sweeps) to AiScientist, freeing researchers to focus on high‑level ideas.
- Continuous integration for research – The File‑as‑Bus model mirrors a CI pipeline: every change is versioned, reproducible, and auditable, easing collaboration across distributed labs.
- Cost‑effective cloud usage – By persisting state, the system can pause and resume jobs, allowing spot‑instance usage without losing progress.
- Educational tooling – New ML engineers can watch the generated workspace evolve, gaining insight into best‑practice research workflows.
- Foundation for autonomous AI labs – The hierarchical + durable‑state design can be plugged into larger “AI‑run‑AI” ecosystems, where one system designs experiments and another executes them reliably.
Limitations & Future Work
- Dependency on LLM reliability – The agents still inherit hallucination risks; occasional incorrect code requires human oversight.
- File‑system bottleneck – Large datasets or model checkpoints can strain the simple file‑as‑bus; future work could integrate object stores or version‑control backends.
- Domain specificity – Benchmarks focus on standard supervised learning tasks; extending to reinforcement learning, multimodal pipelines, or hardware‑specific optimizations remains open.
- Scalability of orchestration – While the current orchestrator handles a single project, coordinating dozens of concurrent projects will need more sophisticated scheduling and resource management.
The authors suggest exploring richer artifact types (e.g., notebooks, Docker images) and tighter integration with automated debugging tools as next steps.
Authors
- Guoxin Chen
- Jie Chen
- Lei Chen
- Jiale Zhao
- Fanzhe Meng
- Wayne Xin Zhao
- Ruihua Song
- Cheng Chen
- Ji‑Rong Wen
- Kai Jia
Paper Information
- arXiv ID: 2604.13018v1
- Categories: cs.CL
- Published: April 14, 2026
- PDF: Download PDF