[Paper] From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Published: 1 day ago (April 23, 2026 at 01:52 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21910v1

Overview

The paper introduces a three‑layer “agentic AI” architecture that lets scientists describe a research question in plain English and automatically receive a fully‑specified, reproducible workflow ready to run on modern orchestration platforms (e.g., Kubernetes). By separating the semantic interpretation of the query from the deterministic workflow generation, the system bridges the long‑standing gap between high‑level scientific intent and low‑level execution engines.

Key Contributions

Agentic AI pipeline that splits the problem into a semantic LLM layer, a deterministic workflow‑generation layer, and a knowledge‑base “Skills” layer.
Skills framework: markdown‑based, human‑authorable modules that encode domain vocabularies, parameter constraints, and optimization heuristics.
Deterministic workflow DAGs: once the intent is extracted, the same input always yields the same reproducible workflow graph.
Empirical validation on a real‑world population‑genetics pipeline (1000 Genomes) using Hyperflow on Kubernetes, showing near‑real‑time query handling.
Ablation study on 150 natural‑language queries demonstrating a jump from 44 % to 83 % full‑match intent accuracy when Skills are used.

Methodology

Semantic Layer (LLM) – A large language model receives the user’s natural‑language question and produces a structured intent (e.g., “run a GWAS on chromosome 22 using the 1000 Genomes dataset”).
Knowledge Layer (Skills) – Domain experts write markdown “Skill” files that map scientific terminology to concrete workflow components, define allowed parameter ranges, and suggest performance‑tuning strategies. The system validates the LLM’s intent against these Skills, correcting or rejecting ambiguous parts.
Deterministic Layer – A rule‑based generator consumes the validated intent and the relevant Skills to emit a directed‑acyclic graph (DAG) that conforms to the Hyperflow workflow description language. Because this step is purely rule‑based, identical intents always produce identical DAGs.
Execution – The generated DAG is submitted to Hyperflow, which schedules containers on a Kubernetes cluster. The pipeline measures total latency, LLM inference cost, and data‑movement overhead.

Results & Findings

Metric	Baseline (no Skills)	With Skills
Full‑match intent accuracy	44 %	83 %
Data transferred per query	–	92 % reduction (deferred generation avoids unnecessary intermediate files)
End‑to‑end latency (incl. LLM)	–	< 15 s per query
Cost per query (LLM inference)	–	≈ $0.001

The study shows that the Skills layer not only boosts semantic understanding but also yields substantial runtime savings by pruning unnecessary data movement. The overall system remains lightweight enough for on‑demand scientific queries without incurring prohibitive cloud costs.

Practical Implications

Rapid prototyping: Researchers can spin up complex analyses (e.g., GWAS, RNA‑seq pipelines) by typing a single sentence, dramatically shortening the “idea‑to‑experiment” cycle.
Reproducibility as a service: Because the deterministic layer guarantees identical DAGs for the same intent, labs can share queries instead of bulky workflow scripts, ensuring consistent results across sites.
Cost‑effective cloud usage: The sub‑cent‑per‑query price point makes it feasible to expose scientific workflows as SaaS endpoints for internal platforms or public portals.
Lowered expertise barrier: Non‑engineer scientists no longer need deep knowledge of Kubernetes, container orchestration, or workflow DSLs; the Skills layer encapsulates that expertise.
Extensible ecosystem: New domains (e.g., climate modeling, drug discovery) can be onboarded simply by authoring additional Skills, enabling a plug‑and‑play expansion of the system.

Limitations & Future Work

Skill authoring overhead: While markdown Skills are lightweight, creating and maintaining high‑quality Skill libraries still requires domain experts and can become a bottleneck for niche fields.
LLM reliance for intent extraction: Errors in the semantic layer (e.g., ambiguous phrasing) can propagate downstream; the current system mitigates this with validation but does not eliminate it.
Scalability to massive DAGs: The evaluation focused on a single‑node genomics workflow; future work should test the architecture on multi‑stage, multi‑petabyte pipelines.
Security & provenance: Automated generation of workflows raises concerns about unintentional data leakage or misuse; integrating fine‑grained access controls and audit trails is an open research direction.

Overall, the paper demonstrates a promising path toward truly “natural‑language‑driven” scientific computing, turning research questions into reproducible, cloud‑native workflows with minimal human friction.

Authors

Bartosz Balis
Michal Orzechowski
Piotr Kica
Michal Dygas
Michal Kuszewski

Paper Information

arXiv ID: 2604.21910v1
Categories: cs.AI
Published: April 23, 2026
PDF: Download PDF

[Paper] From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

[Paper] The Sample Complexity of Multicalibration