[Paper] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Published: 1 month ago (December 16, 2025 at 12:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14620v1

Overview

The paper presents JMMMU‑Pro, a new benchmark that tests how well vision‑language models (VLMs) understand Japanese content when the question is embedded directly in an image. By merging the visual scene and the textual prompt, the benchmark forces models to perform true multimodal reasoning rather than treating text and image as separate inputs. The authors also introduce Vibe Benchmark Construction, a low‑cost pipeline that uses a state‑of‑the‑art image generator (Nano Banana Pro) together with human verification to create high‑quality, diverse visual‑question pairs at scale.

Key Contributions

JMMMU‑Pro dataset: Extends the earlier JMMMU benchmark by embedding Japanese question text into the image, creating a more challenging visual‑textual integration task.
Vibe Benchmark Construction pipeline: A scalable, human‑in‑the‑loop workflow that leverages generative AI to produce candidate images, then refines them via prompt tweaking and manual validation.
Comprehensive evaluation: Shows that current open‑source large multimodal models (LMMs) perform poorly on JMMMU‑Pro, highlighting a gap in Japanese multimodal understanding.
Open‑source resources: Releases the dataset, generation scripts, and prompt templates to enable the community to reproduce and extend the benchmark.

Methodology

Prompt‑driven image generation: The authors craft Japanese‑language prompts that describe a visual scene and embed a question (e.g., “What is the color of the car?”) directly into the image. Nano Banana Pro, a diffusion model capable of rendering crisp Japanese characters, generates multiple candidate images per prompt.
Human verification loop: Annotators inspect each generated image for visual fidelity, legibility of the embedded text, and relevance of the question to the scene. If an image fails, the prompt is adjusted (e.g., changing font size, layout, or scene details) and regenerated.
Dataset assembly: Verified images are paired with the original question and a set of answer choices, forming a classic VQA format but with the twist that the model must first locate and read the question within the picture before answering.
Benchmarking: A suite of open‑source LMMs (e.g., LLaVA, MiniGPT‑4, and others) are evaluated on JMMMU‑Pro using standard VQA accuracy metrics.

The pipeline is deliberately modular: any image generator that can embed clean Japanese text can replace Nano Banana Pro, and the verification step can be crowdsourced or semi‑automated.

Results & Findings

Performance gap: All tested open‑source LMMs scored below 30 % accuracy, far lower than their results on English‑centric VQA benchmarks.
Error analysis: The biggest failure modes were (a) missing or misreading the embedded Japanese question, and (b) lacking cultural or domain knowledge needed to answer discipline‑specific questions (e.g., history, science).
Cost efficiency: Using Vibe Benchmark Construction, the authors built a 10k‑item benchmark for roughly $2,000 USD, a fraction of traditional data‑collection costs.

These findings confirm that current models are not yet ready for real‑world Japanese multimodal applications and that the benchmark is a useful stress test for future research.

Practical Implications

Product localization: Companies building AI assistants for Japanese markets need to ensure their VLMs can read and reason about on‑screen text—a capability that JMMMU‑Pro directly measures.
Document AI: Applications such as automated form processing, receipt scanning, or educational tools often involve mixed visual and textual cues; the benchmark highlights the importance of joint perception.
Open‑source model development: Researchers can use the Vibe pipeline to quickly spin up new multimodal datasets in other languages or domains, accelerating the creation of niche benchmarks without massive annotation budgets.
Evaluation standard: JMMMU‑Pro can become a go‑to sanity check before deploying a VLM in any Japanese‑centric product, similar to how ImageNet is used for vision models.

Limitations & Future Work

Scope of disciplines: While the dataset covers many subjects, it still leans toward academic‑style questions; real‑world UI or street‑sign scenarios are under‑represented.
Human verification bottleneck: The current pipeline relies on manual checks, which may limit scalability for truly massive benchmarks.
Model diversity: Evaluation focused on open‑source LMMs; proprietary models (e.g., GPT‑4V) were not tested, leaving open the question of how close the state of the art truly is.
Future directions: The authors suggest extending Vibe to generate dynamic multimodal tasks (e.g., video‑based VQA), incorporating automated OCR‑based validation, and exploring cross‑lingual transfer where a model trained on English VQA is fine‑tuned on JMMMU‑Pro.

Authors

Atsuyuki Miyai
Shota Onohara
Jeonghun Baek
Kiyoharu Aizawa

Paper Information

arXiv ID: 2512.14620v1
Categories: cs.CL, cs.AI, cs.CV
Published: December 16, 2025
PDF: Download PDF

[Paper] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

[Paper] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models