[Paper] From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation
Source: arXiv - 2604.21910v1
Overview
The paper introduces a three‑layer “agentic AI” architecture that lets scientists describe a research question in plain English and automatically receive a fully‑specified, reproducible workflow ready to run on modern orchestration platforms (e.g., Kubernetes). By separating the semantic interpretation of the query from the deterministic workflow generation, the system bridges the long‑standing gap between high‑level scientific intent and low‑level execution engines.
Key Contributions
- Agentic AI pipeline that splits the problem into a semantic LLM layer, a deterministic workflow‑generation layer, and a knowledge‑base “Skills” layer.
- Skills framework: markdown‑based, human‑authorable modules that encode domain vocabularies, parameter constraints, and optimization heuristics.
- Deterministic workflow DAGs: once the intent is extracted, the same input always yields the same reproducible workflow graph.
- Empirical validation on a real‑world population‑genetics pipeline (1000 Genomes) using Hyperflow on Kubernetes, showing near‑real‑time query handling.
- Ablation study on 150 natural‑language queries demonstrating a jump from 44 % to 83 % full‑match intent accuracy when Skills are used.
Methodology
- Semantic Layer (LLM) – A large language model receives the user’s natural‑language question and produces a structured intent (e.g., “run a GWAS on chromosome 22 using the 1000 Genomes dataset”).
- Knowledge Layer (Skills) – Domain experts write markdown “Skill” files that map scientific terminology to concrete workflow components, define allowed parameter ranges, and suggest performance‑tuning strategies. The system validates the LLM’s intent against these Skills, correcting or rejecting ambiguous parts.
- Deterministic Layer – A rule‑based generator consumes the validated intent and the relevant Skills to emit a directed‑acyclic graph (DAG) that conforms to the Hyperflow workflow description language. Because this step is purely rule‑based, identical intents always produce identical DAGs.
- Execution – The generated DAG is submitted to Hyperflow, which schedules containers on a Kubernetes cluster. The pipeline measures total latency, LLM inference cost, and data‑movement overhead.
Results & Findings
| Metric | Baseline (no Skills) | With Skills |
|---|---|---|
| Full‑match intent accuracy | 44 % | 83 % |
| Data transferred per query | – | 92 % reduction (deferred generation avoids unnecessary intermediate files) |
| End‑to‑end latency (incl. LLM) | – | < 15 s per query |
| Cost per query (LLM inference) | – | ≈ $0.001 |
The study shows that the Skills layer not only boosts semantic understanding but also yields substantial runtime savings by pruning unnecessary data movement. The overall system remains lightweight enough for on‑demand scientific queries without incurring prohibitive cloud costs.
Practical Implications
- Rapid prototyping: Researchers can spin up complex analyses (e.g., GWAS, RNA‑seq pipelines) by typing a single sentence, dramatically shortening the “idea‑to‑experiment” cycle.
- Reproducibility as a service: Because the deterministic layer guarantees identical DAGs for the same intent, labs can share queries instead of bulky workflow scripts, ensuring consistent results across sites.
- Cost‑effective cloud usage: The sub‑cent‑per‑query price point makes it feasible to expose scientific workflows as SaaS endpoints for internal platforms or public portals.
- Lowered expertise barrier: Non‑engineer scientists no longer need deep knowledge of Kubernetes, container orchestration, or workflow DSLs; the Skills layer encapsulates that expertise.
- Extensible ecosystem: New domains (e.g., climate modeling, drug discovery) can be onboarded simply by authoring additional Skills, enabling a plug‑and‑play expansion of the system.
Limitations & Future Work
- Skill authoring overhead: While markdown Skills are lightweight, creating and maintaining high‑quality Skill libraries still requires domain experts and can become a bottleneck for niche fields.
- LLM reliance for intent extraction: Errors in the semantic layer (e.g., ambiguous phrasing) can propagate downstream; the current system mitigates this with validation but does not eliminate it.
- Scalability to massive DAGs: The evaluation focused on a single‑node genomics workflow; future work should test the architecture on multi‑stage, multi‑petabyte pipelines.
- Security & provenance: Automated generation of workflows raises concerns about unintentional data leakage or misuse; integrating fine‑grained access controls and audit trails is an open research direction.
Overall, the paper demonstrates a promising path toward truly “natural‑language‑driven” scientific computing, turning research questions into reproducible, cloud‑native workflows with minimal human friction.
Authors
- Bartosz Balis
- Michal Orzechowski
- Piotr Kica
- Michal Dygas
- Michal Kuszewski
Paper Information
- arXiv ID: 2604.21910v1
- Categories: cs.AI
- Published: April 23, 2026
- PDF: Download PDF