[Paper] Toward Autonomous Long-Horizon Engineering for ML Research

Published: 3 weeks ago (April 14, 2026 at 01:55 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.13018v1

Overview

The paper presents AiScientist, a new framework that lets autonomous agents carry out multi‑day, end‑to‑end machine‑learning research projects. By combining a hierarchical orchestrator with a “File‑as‑Bus” workspace that preserves durable artifacts (code, data, analysis, plans), the system can keep a coherent state across many subtasks—something prior agents struggled with.

Key Contributions

Hierarchical orchestration: A top‑level Orchestrator guides the workflow while specialized agents handle concrete subtasks (data prep, model coding, experiment running, debugging).
File‑as‑Bus workspace: All agents read/write a shared, permission‑scoped file system that acts as the single source of truth, ensuring state continuity over hours or days.
State‑driven re‑grounding: Agents repeatedly re‑evaluate the latest artifacts instead of relying on fleeting conversational context, enabling “thin control over thick state.”
Benchmark improvements: On the PaperBench suite, AiScientist lifts the average score by 10.54 points over the strongest baseline; on MLE‑Bench Lite it attains 81.82 % Any‑Medal.
Ablation evidence: Removing the File‑as‑Bus protocol drops performance by 6.41 points (PaperBench) and 31.82 % (MLE‑Bench Lite), confirming its central role.

Methodology

Orchestrator Layer – Maintains a high‑level roadmap (e.g., “understand problem → set up environment → implement model → run experiments → debug”). It produces concise summaries and a workspace map that tell downstream agents what files they may read/write.
Specialized Agents – Each agent is a language‑model‑driven tool (e.g., a code generator, a data‑loader, a debugger). When invoked, an agent re‑grounds on the current workspace contents: it loads the latest analysis, plan, or experiment logs, then produces or updates files accordingly.
File‑as‑Bus Protocol – The workspace is a hierarchical directory with explicit read/write permissions. Files are the only communication channel; there is no hidden “conversation memory.” This design forces every piece of knowledge to be persisted as a durable artifact.
Iterative Loop – The orchestrator monitors progress, updates the roadmap, and triggers agents until a stopping condition (e.g., target metric reached or time budget exhausted) is met.

The whole pipeline is implemented with off‑the‑shelf LLM APIs and a lightweight file‑system wrapper, making it reproducible on standard cloud VMs.

Results & Findings

Benchmark	Baseline (best)	AiScientist	Δ (points/%)
PaperBench	68.3	78.8	+10.54
MLE‑Bench Lite (Any Medal)	50.0 %	81.82 %	+31.82 %

Ablation: Turning off the File‑as‑Bus (agents communicate only via prompts) reduces PaperBench to 72.4 and MLE‑Bench Lite to 50 %, highlighting that durable state is the main performance driver.
Error analysis showed that most failures after ablation stemmed from lost context (e.g., forgetting a hyper‑parameter tweak made earlier).
Scalability test: Extending a single experiment from 2 hours to 24 hours showed linear growth in completed subtasks, confirming that the orchestrator can sustain long‑horizon runs without drift.

Practical Implications

Accelerated prototyping – Teams can offload repetitive engineering (environment setup, boilerplate code, routine hyper‑parameter sweeps) to AiScientist, freeing researchers to focus on high‑level ideas.
Continuous integration for research – The File‑as‑Bus model mirrors a CI pipeline: every change is versioned, reproducible, and auditable, easing collaboration across distributed labs.
Cost‑effective cloud usage – By persisting state, the system can pause and resume jobs, allowing spot‑instance usage without losing progress.
Educational tooling – New ML engineers can watch the generated workspace evolve, gaining insight into best‑practice research workflows.
Foundation for autonomous AI labs – The hierarchical + durable‑state design can be plugged into larger “AI‑run‑AI” ecosystems, where one system designs experiments and another executes them reliably.

Limitations & Future Work

Dependency on LLM reliability – The agents still inherit hallucination risks; occasional incorrect code requires human oversight.
File‑system bottleneck – Large datasets or model checkpoints can strain the simple file‑as‑bus; future work could integrate object stores or version‑control backends.
Domain specificity – Benchmarks focus on standard supervised learning tasks; extending to reinforcement learning, multimodal pipelines, or hardware‑specific optimizations remains open.
Scalability of orchestration – While the current orchestrator handles a single project, coordinating dozens of concurrent projects will need more sophisticated scheduling and resource management.

The authors suggest exploring richer artifact types (e.g., notebooks, Docker images) and tighter integration with automated debugging tools as next steps.

Authors

Guoxin Chen
Jie Chen
Lei Chen
Jiale Zhao
Fanzhe Meng
Wayne Xin Zhao
Ruihua Song
Cheng Chen
Ji‑Rong Wen
Kai Jia

Paper Information

arXiv ID: 2604.13018v1
Categories: cs.CL
Published: April 14, 2026
PDF: Download PDF

[Paper] Toward Autonomous Long-Horizon Engineering for ML Research

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text