[Paper] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Published: 2 months ago (February 17, 2026 at 12:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15758v1

Overview

The paper introduces ChartEditBench, a new benchmark that tests how well multimodal large language models (MLLMs) can handle incremental chart editing—think of a data analyst who repeatedly tweaks a visualization until it tells the right story. While existing models excel at generating a chart in a single shot, this work probes their ability to maintain context across multiple turns, a capability crucial for real‑world exploratory data analysis.

Key Contributions

ChartEditBench dataset: 5 000 curated, difficulty‑controlled edit chains (code‑based modifications to charts) with a human‑verified subset for high‑quality evaluation.
Grounded multi‑turn evaluation framework: Combines execution‑based fidelity checks, pixel‑level visual similarity metrics, and logical code verification to overcome the shortcomings of “LLM‑as‑judge” scoring.
Comprehensive empirical study: Benchmarks several state‑of‑the‑art MLLMs, revealing how performance drops as the number of editing turns grows.
Error taxonomy: Identifies where models fail most—data‑centric transformations vs. purely stylistic tweaks—providing a roadmap for future research.
Open‑source release: Dataset, evaluation scripts, and baseline results are publicly available to spur community progress.

Methodology

Dataset construction

Start from a base set of charts (e.g., bar, line, scatter) generated from synthetic and real‑world tabular data.
Define a series of edit intents (e.g., “change X‑axis to log scale”, “filter out rows where sales < 1000”).
Chain 3–7 edits together to form a modification sequence, controlling difficulty by varying the amount of code change required.
Human annotators verify that each edit is feasible, correctly expressed, and that the final chart matches the intended description.

Model interaction protocol

Each turn the model receives the current chart image, the underlying code (e.g., Python/Matplotlib or Vega‑Lite JSON), and a natural‑language edit request.
The model must output revised code; the system then executes it to produce the next chart image.

Evaluation pipeline

Execution fidelity: Does the generated code run without errors?
Visual similarity: Compute pixel‑level metrics (SSIM, LPIPS) between the model‑produced chart and a reference chart.
Logical verification: Parse both the reference and model code to check that the intended data transformation (filter, aggregation, axis change) actually occurred.
Scores from these three axes are aggregated into a single benchmark metric, providing a more reliable “ground truth” than pure LLM‑based judgment.

Results & Findings

Model	Avg. Success (single‑turn)	Avg. Success (multi‑turn, 5 steps)	Main failure mode
GPT‑4‑V (vision‑enabled)	87 %	58 %	Execution crashes on data filters
LLaVA‑13B	71 %	42 %	Mis‑interpreting prior edits, losing context
MiniGPT‑4	65 %	35 %	Syntax errors in generated code

Stylistic edits (color, font, layout) remain relatively robust across turns, with >80 % success even after five interactions.
Data‑centric edits (filtering, re‑aggregating, changing data source) suffer steep degradation; error accumulation leads to broken pipelines in >40 % of multi‑turn runs.
The integrated evaluation framework shows that pure LLM‑as‑judge scores overestimate performance by ~15 % because they miss execution failures.

Practical Implications

Developer tooling: Building IDE‑like assistants for data visualization (e.g., “Chat‑with‑your‑chart”) will need mechanisms for state tracking and error recovery to avoid cascading failures.
Business intelligence platforms: Embedding MLLMs for conversational chart tweaking can accelerate exploratory analysis, but teams should guard against silent execution errors—automated sanity checks become essential.
Low‑code/no‑code environments: The benchmark highlights that while MLLMs can handle visual polish automatically, they still need tighter integration with data processing back‑ends to reliably perform substantive data transformations.
Model training: Future pre‑training or fine‑tuning pipelines should incorporate multi‑turn interaction data and explicit code execution feedback loops to improve robustness.

Limitations & Future Work

Dataset scope: ChartEditBench focuses on a limited set of chart types and uses primarily Python/Matplotlib and Vega‑Lite; extending to other libraries (e.g., D3.js, Plotly) would broaden applicability.
Synthetic bias: Although real‑world tables are included, a large portion of the data is synthetic, which may not capture all quirks of messy production datasets.
Human verification size: Only a subset of edit chains is fully human‑validated, leaving a small risk of annotation noise in the larger set.
Future directions: The authors suggest exploring interactive debugging where the model can ask clarification questions, integrating retrieval‑augmented data context, and expanding the benchmark to cover dashboard‑level multi‑chart editing.

Authors

Manav Nitin Kapadnis
Lawanya Baghel
Atharva Naik
Carolyn Rosé

Paper Information

arXiv ID: 2602.15758v1
Categories: cs.CL, cs.AI
Published: February 17, 2026
PDF: Download PDF