[Paper] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
Source: arXiv - 2604.25914v1
Overview
The paper introduces DV‑World, a new benchmark that puts data‑visualization (DV) agents through the kind of messy, multi‑step tasks they would face in real enterprise environments. By covering spreadsheet editing, code‑driven visual evolution, and interactive intent clarification, the authors expose gaps in current large‑language‑model (LLM)‑based visualizers that traditional sandbox‑only tests miss.
Key Contributions
- A 260‑task suite spanning three realistic DV domains:
- DV‑Sheet – native spreadsheet manipulation, chart/dashboard creation, and error‑repair.
- DV‑Evolution – adapting existing visual artifacts to new data across multiple programming languages/frameworks.
- DV‑Interact – proactive intent alignment with a user‑simulator that generates ambiguous, evolving requirements.
- Hybrid evaluation framework that combines:
- Table‑value Alignment for strict numerical correctness, and
- MLLM‑as‑a‑Judge with rubric‑based scoring for semantic‑visual quality.
- Comprehensive baseline study showing that even the strongest publicly available LLMs (e.g., GPT‑4‑Turbo, Claude‑3) achieve < 50 % overall success, highlighting a substantial performance gap.
- Open‑source release of the dataset, evaluation scripts, and user‑simulator, enabling reproducible research and industry‑focused development.
Methodology
- Task Design – Each of the 260 tasks mirrors a step in a professional DV workflow (e.g., “add a trend line to an existing Excel chart”, “port a Python‑Matplotlib plot to a D3.js interactive dashboard”, “clarify a vague user request for a sales funnel visualization”).
- Agent Interaction – Agents receive a textual prompt plus any necessary artefacts (spreadsheets, code snippets, prior visualizations). For DV‑Interact, a turn‑based dialogue with a simulated user is required.
- Execution Environment – Unlike sandbox‑only benchmarks, DV‑World runs agents in a real‑world toolchain (Excel via COM, Jupyter notebooks for Python/R, Node.js for JavaScript). This forces agents to handle file I/O, library imports, and platform‑specific quirks.
- Evaluation –
- Numerical Alignment: The generated visual’s underlying data table is compared element‑wise to the ground‑truth using a tolerance‑based metric.
- Semantic‑Visual Scoring: An LLM judge reads the prompt, the produced visualization (or code), and a rubric (e.g., “chart type matches intent, axes labeled correctly, legend present”) and returns a score from 0–5.
- Final performance is the average of the two components across all tasks.
Results & Findings
| Model | DV‑Sheet | DV‑Evolution | DV‑Interact | Overall |
|---|---|---|---|---|
| GPT‑4‑Turbo | 48 % | 42 % | 35 % | 45 % |
| Claude‑3 | 45 % | 38 % | 33 % | 42 % |
| LLaMA‑2‑70B | 31 % | 27 % | 22 % | 27 % |
- Numerical precision is relatively higher (≈ 70 % of successful cases) than semantic‑visual quality, which lags behind (≈ 30 %).
- Agents struggle most with DV‑Evolution, where they must understand and rewrite code across languages (Python → R, JavaScript → Vega‑Lite).
- In DV‑Interact, the simulated user’s ambiguous requests cause a sharp drop in success, exposing weak intent‑clarification and dialogue management.
- Error analysis shows frequent failures in:
(a) handling spreadsheet formulas,
(b) installing or importing the correct visualization libraries, and
(c) asking clarifying questions when the user intent is underspecified.
Practical Implications
- Tooling for Developers – The benchmark highlights that current LLM‑based assistants cannot be trusted to autonomously generate production‑grade dashboards. Teams should treat them as co‑pilots that need human oversight, especially for cross‑language refactoring and ambiguous requirements.
- Enterprise Automation – Companies looking to automate report generation will need to invest in domain‑specific fine‑tuning or hybrid pipelines (LLM + rule‑based validators) to meet the precision demanded by finance or operations teams.
- Product Roadmaps – Visualization platforms (e.g., Tableau, Power BI) can use DV‑World to benchmark and improve their AI‑assist features, focusing on better intent disambiguation and environmental grounding (e.g., direct Excel API calls).
- Developer Education – The tasks serve as realistic practice problems for engineers learning to integrate LLMs with data‑science toolchains, encouraging a mindset that blends natural‑language reasoning with concrete API usage.
Limitations & Future Work
- Scope of Domains – DV‑World currently covers spreadsheets, code‑based visualizations, and a simulated dialogue; it does not yet include GIS‑style maps, real‑time streaming dashboards, or VR/AR visualizations.
- Simulator Realism – The user simulator follows scripted ambiguity patterns; real users may exhibit richer conversational behaviours, which could affect agent performance.
- Evaluation Bias – Relying on an LLM judge introduces its own biases; future work could incorporate human expert ratings for a subset of tasks to calibrate the rubric scores.
- Scalability – Running agents in full toolchains is computationally expensive; optimizing the benchmark for large‑scale evaluation (e.g., containerized micro‑environments) is an open engineering challenge.
By exposing these gaps, DV‑World sets a concrete target for the next generation of data‑visualization agents that can truly operate in the messy, multi‑tool ecosystems of modern enterprises.
Authors
- Jinxiang Meng
- Shaoping Huang
- Fangyu Lei
- Jingyu Guo
- Haoxiang Liu
- Jiahao Su
- Sihan Wang
- Yao Wang
- Enrui Wang
- Ye Yang
- Hongze Chai
- Jinming Lv
- Anbang Yu
- Huangjing Zhang
- Yitong Zhang
- Yiming Huang
- Zeyao Ma
- Shizhu He
- Jun Zhao
- Kang Liu
Paper Information
- arXiv ID: 2604.25914v1
- Categories: cs.CL
- Published: April 28, 2026
- PDF: Download PDF