[Paper] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Source: arXiv - 2602.15758v1
Overview
The paper introduces ChartEditBench, a new benchmark that tests how well multimodal large language models (MLLMs) can handle incremental chart editing—think of a data analyst who repeatedly tweaks a visualization until it tells the right story. While existing models excel at generating a chart in a single shot, this work probes their ability to maintain context across multiple turns, a capability crucial for real‑world exploratory data analysis.
Key Contributions
- ChartEditBench dataset: 5 000 curated, difficulty‑controlled edit chains (code‑based modifications to charts) with a human‑verified subset for high‑quality evaluation.
- Grounded multi‑turn evaluation framework: Combines execution‑based fidelity checks, pixel‑level visual similarity metrics, and logical code verification to overcome the shortcomings of “LLM‑as‑judge” scoring.
- Comprehensive empirical study: Benchmarks several state‑of‑the‑art MLLMs, revealing how performance drops as the number of editing turns grows.
- Error taxonomy: Identifies where models fail most—data‑centric transformations vs. purely stylistic tweaks—providing a roadmap for future research.
- Open‑source release: Dataset, evaluation scripts, and baseline results are publicly available to spur community progress.
Methodology
-
Dataset construction
- Start from a base set of charts (e.g., bar, line, scatter) generated from synthetic and real‑world tabular data.
- Define a series of edit intents (e.g., “change X‑axis to log scale”, “filter out rows where sales < 1000”).
- Chain 3–7 edits together to form a modification sequence, controlling difficulty by varying the amount of code change required.
- Human annotators verify that each edit is feasible, correctly expressed, and that the final chart matches the intended description.
-
Model interaction protocol
- Each turn the model receives the current chart image, the underlying code (e.g., Python/Matplotlib or Vega‑Lite JSON), and a natural‑language edit request.
- The model must output revised code; the system then executes it to produce the next chart image.
-
Evaluation pipeline
- Execution fidelity: Does the generated code run without errors?
- Visual similarity: Compute pixel‑level metrics (SSIM, LPIPS) between the model‑produced chart and a reference chart.
- Logical verification: Parse both the reference and model code to check that the intended data transformation (filter, aggregation, axis change) actually occurred.
- Scores from these three axes are aggregated into a single benchmark metric, providing a more reliable “ground truth” than pure LLM‑based judgment.
Results & Findings
| Model | Avg. Success (single‑turn) | Avg. Success (multi‑turn, 5 steps) | Main failure mode |
|---|---|---|---|
| GPT‑4‑V (vision‑enabled) | 87 % | 58 % | Execution crashes on data filters |
| LLaVA‑13B | 71 % | 42 % | Mis‑interpreting prior edits, losing context |
| MiniGPT‑4 | 65 % | 35 % | Syntax errors in generated code |
- Stylistic edits (color, font, layout) remain relatively robust across turns, with >80 % success even after five interactions.
- Data‑centric edits (filtering, re‑aggregating, changing data source) suffer steep degradation; error accumulation leads to broken pipelines in >40 % of multi‑turn runs.
- The integrated evaluation framework shows that pure LLM‑as‑judge scores overestimate performance by ~15 % because they miss execution failures.
Practical Implications
- Developer tooling: Building IDE‑like assistants for data visualization (e.g., “Chat‑with‑your‑chart”) will need mechanisms for state tracking and error recovery to avoid cascading failures.
- Business intelligence platforms: Embedding MLLMs for conversational chart tweaking can accelerate exploratory analysis, but teams should guard against silent execution errors—automated sanity checks become essential.
- Low‑code/no‑code environments: The benchmark highlights that while MLLMs can handle visual polish automatically, they still need tighter integration with data processing back‑ends to reliably perform substantive data transformations.
- Model training: Future pre‑training or fine‑tuning pipelines should incorporate multi‑turn interaction data and explicit code execution feedback loops to improve robustness.
Limitations & Future Work
- Dataset scope: ChartEditBench focuses on a limited set of chart types and uses primarily Python/Matplotlib and Vega‑Lite; extending to other libraries (e.g., D3.js, Plotly) would broaden applicability.
- Synthetic bias: Although real‑world tables are included, a large portion of the data is synthetic, which may not capture all quirks of messy production datasets.
- Human verification size: Only a subset of edit chains is fully human‑validated, leaving a small risk of annotation noise in the larger set.
- Future directions: The authors suggest exploring interactive debugging where the model can ask clarification questions, integrating retrieval‑augmented data context, and expanding the benchmark to cover dashboard‑level multi‑chart editing.
Authors
- Manav Nitin Kapadnis
- Lawanya Baghel
- Atharva Naik
- Carolyn Rosé
Paper Information
- arXiv ID: 2602.15758v1
- Categories: cs.CL, cs.AI
- Published: February 17, 2026
- PDF: Download PDF