[Paper] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Published: (February 17, 2026 at 12:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15758v1

Overview

The paper introduces ChartEditBench, a new benchmark that tests how well multimodal large language models (MLLMs) can handle incremental chart editing—think of a data analyst who repeatedly tweaks a visualization until it tells the right story. While existing models excel at generating a chart in a single shot, this work probes their ability to maintain context across multiple turns, a capability crucial for real‑world exploratory data analysis.

Key Contributions

  • ChartEditBench dataset: 5 000 curated, difficulty‑controlled edit chains (code‑based modifications to charts) with a human‑verified subset for high‑quality evaluation.
  • Grounded multi‑turn evaluation framework: Combines execution‑based fidelity checks, pixel‑level visual similarity metrics, and logical code verification to overcome the shortcomings of “LLM‑as‑judge” scoring.
  • Comprehensive empirical study: Benchmarks several state‑of‑the‑art MLLMs, revealing how performance drops as the number of editing turns grows.
  • Error taxonomy: Identifies where models fail most—data‑centric transformations vs. purely stylistic tweaks—providing a roadmap for future research.
  • Open‑source release: Dataset, evaluation scripts, and baseline results are publicly available to spur community progress.

Methodology

  1. Dataset construction

    • Start from a base set of charts (e.g., bar, line, scatter) generated from synthetic and real‑world tabular data.
    • Define a series of edit intents (e.g., “change X‑axis to log scale”, “filter out rows where sales < 1000”).
    • Chain 3–7 edits together to form a modification sequence, controlling difficulty by varying the amount of code change required.
    • Human annotators verify that each edit is feasible, correctly expressed, and that the final chart matches the intended description.
  2. Model interaction protocol

    • Each turn the model receives the current chart image, the underlying code (e.g., Python/Matplotlib or Vega‑Lite JSON), and a natural‑language edit request.
    • The model must output revised code; the system then executes it to produce the next chart image.
  3. Evaluation pipeline

    • Execution fidelity: Does the generated code run without errors?
    • Visual similarity: Compute pixel‑level metrics (SSIM, LPIPS) between the model‑produced chart and a reference chart.
    • Logical verification: Parse both the reference and model code to check that the intended data transformation (filter, aggregation, axis change) actually occurred.
    • Scores from these three axes are aggregated into a single benchmark metric, providing a more reliable “ground truth” than pure LLM‑based judgment.

Results & Findings

ModelAvg. Success (single‑turn)Avg. Success (multi‑turn, 5 steps)Main failure mode
GPT‑4‑V (vision‑enabled)87 %58 %Execution crashes on data filters
LLaVA‑13B71 %42 %Mis‑interpreting prior edits, losing context
MiniGPT‑465 %35 %Syntax errors in generated code
  • Stylistic edits (color, font, layout) remain relatively robust across turns, with >80 % success even after five interactions.
  • Data‑centric edits (filtering, re‑aggregating, changing data source) suffer steep degradation; error accumulation leads to broken pipelines in >40 % of multi‑turn runs.
  • The integrated evaluation framework shows that pure LLM‑as‑judge scores overestimate performance by ~15 % because they miss execution failures.

Practical Implications

  • Developer tooling: Building IDE‑like assistants for data visualization (e.g., “Chat‑with‑your‑chart”) will need mechanisms for state tracking and error recovery to avoid cascading failures.
  • Business intelligence platforms: Embedding MLLMs for conversational chart tweaking can accelerate exploratory analysis, but teams should guard against silent execution errors—automated sanity checks become essential.
  • Low‑code/no‑code environments: The benchmark highlights that while MLLMs can handle visual polish automatically, they still need tighter integration with data processing back‑ends to reliably perform substantive data transformations.
  • Model training: Future pre‑training or fine‑tuning pipelines should incorporate multi‑turn interaction data and explicit code execution feedback loops to improve robustness.

Limitations & Future Work

  • Dataset scope: ChartEditBench focuses on a limited set of chart types and uses primarily Python/Matplotlib and Vega‑Lite; extending to other libraries (e.g., D3.js, Plotly) would broaden applicability.
  • Synthetic bias: Although real‑world tables are included, a large portion of the data is synthetic, which may not capture all quirks of messy production datasets.
  • Human verification size: Only a subset of edit chains is fully human‑validated, leaving a small risk of annotation noise in the larger set.
  • Future directions: The authors suggest exploring interactive debugging where the model can ask clarification questions, integrating retrieval‑augmented data context, and expanding the benchmark to cover dashboard‑level multi‑chart editing.

Authors

  • Manav Nitin Kapadnis
  • Lawanya Baghel
  • Atharva Naik
  • Carolyn Rosé

Paper Information

  • arXiv ID: 2602.15758v1
  • Categories: cs.CL, cs.AI
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »