[Paper] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Published: (November 26, 2025 at 07:44 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21339v1

Overview

The paper introduces SurgMLLMBench, a new benchmark that brings together pixel‑level instrument segmentation and structured visual‑question‑answer (VQA) data across laparoscopic, robot‑assisted, and micro‑surgical procedures. By unifying these modalities under a single taxonomy, the authors give researchers a consistent way to train and evaluate multimodal large language models (LLMs) that can “see” and “talk” about surgical scenes.

Key Contributions

  • Unified multimodal dataset that combines high‑resolution video frames, pixel‑wise instrument masks, and VQA pairs for three surgical domains (laparoscopy, robotics, micro‑surgery).
  • MAVIS sub‑dataset (Micro‑surgical Artificial Vascular anastomosIS) – the first publicly available micro‑surgical video set with detailed segmentation and reasoning annotations.
  • Standardized taxonomy for instruments, actions, and anatomical structures, eliminating the taxonomy drift that plagued earlier surgical VQA corpora.
  • Baseline experiments showing a single multimodal LLM can be trained on the whole benchmark and still perform competitively on each domain, plus strong generalization to unseen surgical datasets.
  • Open‑source release plan to foster reproducibility and accelerate research on interactive surgical AI.

Methodology

  1. Data collection & annotation – The authors gathered thousands of video frames from real laparoscopic and robot‑assisted surgeries, plus newly recorded micro‑surgical footage. Trained annotators produced:
    • Segmentation masks for every visible instrument pixel.
    • VQA pairs (question, answer) covering instrument identification, procedural steps, and anatomical context.
  2. Taxonomy design – A hierarchical label schema (e.g., Instrument → Type → Tip; Action → Grasp → Cut) was defined and applied uniformly across all domains.
  3. Model training – A multimodal LLM (vision encoder + language decoder) was fine‑tuned on the combined dataset using a joint loss that balances segmentation (pixel‑wise cross‑entropy) and VQA (cross‑entropy on answer tokens).
  4. Evaluation protocol – The benchmark reports:
    • Segmentation IoU (intersection‑over‑union) per instrument class.
    • VQA accuracy (exact match) and BLEU/ROUGE for free‑form answers.
    • Cross‑domain transfer tests where the model is evaluated on a domain it was not trained on.

Results & Findings

  • The unified model achieved ≈78 % mean IoU on instrument segmentation across all three domains, matching or surpassing domain‑specific baselines.
  • VQA performance reached ≈71 % exact‑match accuracy, with notable gains on reasoning questions (e.g., “Why is the surgeon switching tools?”).
  • When tested on an external laparoscopic dataset (not seen during training), the model retained ≈75 % IoU and ≈68 % VQA accuracy, demonstrating robust generalization.
  • Ablation studies confirmed that joint training on segmentation + VQA yields better VQA scores than training on VQA alone, highlighting the benefit of visual grounding.

Practical Implications

  • Assistive intra‑operative tools: Surgeons could query a real‑time AI assistant (“What instrument is currently in view?” or “Is the vessel fully clipped?”) and receive both textual explanations and highlighted masks.
  • Training simulators: Medical educators can embed the model into VR/AR platforms to provide instant feedback on instrument handling and procedural steps.
  • Automated documentation: Post‑operative reports could be auto‑generated by extracting key actions and instrument usage from recorded footage.
  • Cross‑platform AI development: Because the benchmark spans laparoscopy, robotics, and micro‑surgery, developers can build a single model that works on diverse hardware setups, reducing engineering overhead.

Limitations & Future Work

  • Dataset diversity: While the benchmark covers three domains, it still relies on a limited number of hospitals and surgical teams, which may affect cultural or equipment variations.
  • Real‑time constraints: The baseline models were evaluated offline; latency and hardware requirements for intra‑operative deployment remain open questions.
  • Annotation granularity: Current VQA pairs focus on high‑level reasoning; finer‑grained questions (e.g., force estimation, tissue deformation) are not covered.
  • Future directions proposed by the authors include expanding the benchmark to include more specialties (e.g., ENT, orthopedic), adding multimodal modalities like audio or haptic data, and exploring lightweight model architectures for on‑device inference.

Authors

  • Tae-Min Choi
  • Tae Kyeong Jeong
  • Garam Kim
  • Jaemin Lee
  • Yeongyoon Koh
  • In Cheul Choi
  • Jae-Ho Chung
  • Jong Woong Park
  • Juyoun Park

Paper Information

  • arXiv ID: 2511.21339v1
  • Categories: cs.CV, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »