[Paper] EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

Published: (June 11, 2026 at 01:20 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.13602v1

Overview

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0% (143/318 attempts; 95% confidence interval (CI), 36.3—53.7), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts; 95% CI, 31.6—48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0% (124/318 attempts; 95% CI, 30.2—47.8 and 31.0—47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

Key Contributions

This paper presents research in the following areas:

  • cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

  • Harihara Muralidharan
  • Reema Baskar
  • Soo Hee Lee
  • Tim Proctor
  • Kenny Workman

Paper Information

  • arXiv ID: 2606.13602v1
  • Categories: cs.AI
  • Published: June 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »