[Paper] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Published: 19 hours ago (April 28, 2026 at 01:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.25914v1

Overview

The paper introduces DV‑World, a new benchmark that puts data‑visualization (DV) agents through the kind of messy, multi‑step tasks they would face in real enterprise environments. By covering spreadsheet editing, code‑driven visual evolution, and interactive intent clarification, the authors expose gaps in current large‑language‑model (LLM)‑based visualizers that traditional sandbox‑only tests miss.

Key Contributions

A 260‑task suite spanning three realistic DV domains:
1. DV‑Sheet – native spreadsheet manipulation, chart/dashboard creation, and error‑repair.
2. DV‑Evolution – adapting existing visual artifacts to new data across multiple programming languages/frameworks.
3. DV‑Interact – proactive intent alignment with a user‑simulator that generates ambiguous, evolving requirements.
Hybrid evaluation framework that combines:
- Table‑value Alignment for strict numerical correctness, and
- MLLM‑as‑a‑Judge with rubric‑based scoring for semantic‑visual quality.
Comprehensive baseline study showing that even the strongest publicly available LLMs (e.g., GPT‑4‑Turbo, Claude‑3) achieve < 50 % overall success, highlighting a substantial performance gap.
Open‑source release of the dataset, evaluation scripts, and user‑simulator, enabling reproducible research and industry‑focused development.

Methodology

Task Design – Each of the 260 tasks mirrors a step in a professional DV workflow (e.g., “add a trend line to an existing Excel chart”, “port a Python‑Matplotlib plot to a D3.js interactive dashboard”, “clarify a vague user request for a sales funnel visualization”).
Agent Interaction – Agents receive a textual prompt plus any necessary artefacts (spreadsheets, code snippets, prior visualizations). For DV‑Interact, a turn‑based dialogue with a simulated user is required.
Execution Environment – Unlike sandbox‑only benchmarks, DV‑World runs agents in a real‑world toolchain (Excel via COM, Jupyter notebooks for Python/R, Node.js for JavaScript). This forces agents to handle file I/O, library imports, and platform‑specific quirks.
Evaluation –
- Numerical Alignment: The generated visual’s underlying data table is compared element‑wise to the ground‑truth using a tolerance‑based metric.
- Semantic‑Visual Scoring: An LLM judge reads the prompt, the produced visualization (or code), and a rubric (e.g., “chart type matches intent, axes labeled correctly, legend present”) and returns a score from 0–5.
- Final performance is the average of the two components across all tasks.

Results & Findings

Model	DV‑Sheet	DV‑Evolution	DV‑Interact	Overall
GPT‑4‑Turbo	48 %	42 %	35 %	45 %
Claude‑3	45 %	38 %	33 %	42 %
LLaMA‑2‑70B	31 %	27 %	22 %	27 %

Numerical precision is relatively higher (≈ 70 % of successful cases) than semantic‑visual quality, which lags behind (≈ 30 %).
Agents struggle most with DV‑Evolution, where they must understand and rewrite code across languages (Python → R, JavaScript → Vega‑Lite).
In DV‑Interact, the simulated user’s ambiguous requests cause a sharp drop in success, exposing weak intent‑clarification and dialogue management.
Error analysis shows frequent failures in:
(a) handling spreadsheet formulas,
(b) installing or importing the correct visualization libraries, and
(c) asking clarifying questions when the user intent is underspecified.

Practical Implications

Tooling for Developers – The benchmark highlights that current LLM‑based assistants cannot be trusted to autonomously generate production‑grade dashboards. Teams should treat them as co‑pilots that need human oversight, especially for cross‑language refactoring and ambiguous requirements.
Enterprise Automation – Companies looking to automate report generation will need to invest in domain‑specific fine‑tuning or hybrid pipelines (LLM + rule‑based validators) to meet the precision demanded by finance or operations teams.
Product Roadmaps – Visualization platforms (e.g., Tableau, Power BI) can use DV‑World to benchmark and improve their AI‑assist features, focusing on better intent disambiguation and environmental grounding (e.g., direct Excel API calls).
Developer Education – The tasks serve as realistic practice problems for engineers learning to integrate LLMs with data‑science toolchains, encouraging a mindset that blends natural‑language reasoning with concrete API usage.

Limitations & Future Work

Scope of Domains – DV‑World currently covers spreadsheets, code‑based visualizations, and a simulated dialogue; it does not yet include GIS‑style maps, real‑time streaming dashboards, or VR/AR visualizations.
Simulator Realism – The user simulator follows scripted ambiguity patterns; real users may exhibit richer conversational behaviours, which could affect agent performance.
Evaluation Bias – Relying on an LLM judge introduces its own biases; future work could incorporate human expert ratings for a subset of tasks to calibrate the rubric scores.
Scalability – Running agents in full toolchains is computationally expensive; optimizing the benchmark for large‑scale evaluation (e.g., containerized micro‑environments) is an open engineering challenge.

By exposing these gaps, DV‑World sets a concrete target for the next generation of data‑visualization agents that can truly operate in the messy, multi‑tool ecosystems of modern enterprises.

Authors

Jinxiang Meng
Shaoping Huang
Fangyu Lei
Jingyu Guo
Haoxiang Liu
Jiahao Su
Sihan Wang
Yao Wang
Enrui Wang
Ye Yang
Hongze Chai
Jinming Lv
Anbang Yu
Huangjing Zhang
Yitong Zhang
Yiming Huang
Zeyao Ma
Shizhu He
Jun Zhao
Kang Liu

Paper Information

arXiv ID: 2604.25914v1
Categories: cs.CL
Published: April 28, 2026
PDF: Download PDF

[Paper] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] A paradox of AI fluency

[Paper] Toward a Functional Geometric Algebra for Natural Language Semantics

[Paper] Three Models of RLHF Annotation: Extension, Evidence, and Authority