[Paper] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Published: (December 29, 2025 at 11:05 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.23565v1

Overview

A new benchmark called RxnBench puts large multimodal language models (LLMs that can see images and read text) to the test on real‑world chemistry papers. By focusing on how well these models understand reaction schemes, tables, and narrative text, the authors expose a hidden performance gap that matters for any AI‑driven chemistry workflow.

Key Contributions

  • RxnBench benchmark – a two‑level suite (single‑figure QA and full‑document QA) built from 305 reaction schematics and 108 peer‑reviewed articles.
  • 1,525 fine‑grained QA pairs that require visual parsing of molecular structures, recognition of arrows/mechanisms, and logical inference.
  • Comprehensive evaluation of several state‑of‑the‑art multimodal LLMs (e.g., GPT‑4V, LLaVA, MiniGPT‑4) on both tasks.
  • Empirical insight that inference‑time reasoning modules boost performance, yet no model reaches 50 % accuracy on the full‑document task.
  • Clear call‑to‑action for domain‑specific visual encoders and stronger chemical reasoning components.

Methodology

  1. Data Curation – The team mined open‑access chemistry journals, extracted PDF pages, and manually selected reaction schemes that contain rich visual cues (structures, reagents, conditions).
  2. Task Design
    • SF‑QA (Single‑Figure QA): Each reaction diagram is paired with multiple‑choice or short‑answer questions that probe visual perception (e.g., “What is the product’s functional group?”) and mechanistic reasoning (e.g., “Which step is rate‑determining?”).
    • FD‑QA (Full‑Document QA): Models receive the entire article (text + all figures + tables) and must answer higher‑level questions that require stitching information across modalities (e.g., “What catalyst was used in the most efficient pathway described?”).
  3. Model Evaluation – Prompts were standardized across models; outputs were automatically scored against a gold‑standard answer key. For models that support chain‑of‑thought or tool‑use, the authors enabled those features to measure the impact of inference‑time reasoning.
  4. Analysis – Accuracy, error typology (visual mis‑recognition vs. logical fallacy), and runtime were logged to pinpoint failure modes.

Results & Findings

TaskBest Model (with reasoning)Raw AccuracyMain Failure Mode
SF‑QAGPT‑4V (reasoning)38 %Mis‑identifying stereochemistry, confusing similar substructures
FD‑QALLaVA‑13B (reasoning)27 %Inability to link figure captions to narrative, missing table values
Text‑only extraction (baseline)All models> 80 %
  • Visual perception is the bottleneck: models often read the surrounding caption correctly but mis‑read the actual molecular diagram.
  • Reasoning modules (chain‑of‑thought, tool‑use) give a ~10‑15 % boost, confirming that “thinking” helps but does not close the gap.
  • Cross‑modal integration remains weak; none of the evaluated systems can reliably combine data from a table, a figure, and a paragraph to answer a composite question.

Practical Implications

  • Automated literature mining: Current multimodal LLMs can reliably extract textual metadata (titles, abstracts, captions) but cannot yet replace chemists for extracting reaction conditions or mechanistic insights.
  • AI‑assisted synthesis planning: Tools that rely on LLM‑driven reaction extraction will need a specialized visual front‑end (e.g., a chemistry‑trained image encoder) to avoid propagating structural errors.
  • Knowledge‑graph construction: Building searchable reaction databases from PDFs will still require human‑in‑the‑loop validation for the structural components.
  • Productivity plugins: IDE‑style extensions for chemists (e.g., “highlight reagents in this PDF”) can be built today, but deeper question answering will need the next generation of models.

Limitations & Future Work

  • Domain coverage: RxnBench focuses on organic synthesis papers; other sub‑fields (materials, biochemistry) are not represented.
  • Scale of evaluation: Only a handful of publicly available multimodal LLMs were tested; proprietary models may behave differently.
  • Human annotation bias: The QA pairs were authored by a small team of chemists, which could limit diversity of question styles.
  • Future directions suggested by the authors include: training visual encoders on large reaction‑scheme datasets, integrating symbolic chemistry reasoning engines (e.g., rule‑based retrosynthesis), and expanding the benchmark to cover multi‑step synthetic routes and kinetic data.

Authors

  • Hanzheng Li
  • Xi Fang
  • Yixuan Li
  • Chaozheng Huang
  • Junjie Wang
  • Xi Wang
  • Hongzhe Bai
  • Bojun Hao
  • Shenyu Lin
  • Huiqi Liang
  • Linfeng Zhang
  • Guolin Ke

Paper Information

  • arXiv ID: 2512.23565v1
  • Categories: cs.CV, cs.AI
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »