[Paper] Do Large Language Models Understand Data Visualization Rules?

Published: (February 23, 2026 at 01:47 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.20137v1

Overview

The paper investigates whether large language models (LLMs) can understand and enforce the design rules that make data visualizations clear and trustworthy. By comparing LLMs against a rule‑checking system called Draco, the authors provide the first systematic, “hard‑verification” benchmark for LLM‑based visualization validation.

Key Contributions

  • Benchmark creation – 2,000 Vega‑Lite chart specifications annotated with explicit rule violations, derived from Draco’s constraint set.
  • Natural‑language translation pipeline – Converted formal ASP (Answer Set Programming) constraints into plain English prompts, enabling LLMs to reason about rules.
  • Comprehensive evaluation – Measured both accuracy (detecting violations) and prompt adherence (producing output in the required structured format) across several frontier models (Gemma‑3 4B/27B, GPT‑oss 20B).
  • Insightful performance analysis – Showed strong results on syntactic/semantic rules (F1 ≈ 0.82) but severe drops on subtle perceptual rules (F1 < 0.15).
  • Guidelines for model‑prompt design – Demonstrated that natural‑language phrasing of constraints can boost smaller models’ performance by up to 150 %.

Methodology

  1. Rule selection & formalization – The authors took a subset of Draco’s 150+ constraints (covering axis labeling, color encoding, mark selection, etc.) and expressed each as an ASP rule, which serves as a gold‑standard verifier.
  2. Dataset generation – Starting from a pool of valid Vega‑Lite specs, they programmatically introduced single‑rule violations (e.g., missing axis title, using a non‑perceptually‑distinct color palette). Each spec was labeled with the exact rule(s) broken.
  3. Prompt design – For each rule, a natural‑language description was crafted (e.g., “The x‑axis must have a descriptive title”). Two prompt styles were tested: a direct translation of the ASP clause vs. a more conversational phrasing.
  4. Model evaluation – LLMs received the Vega‑Lite JSON and the rule description, then were asked to output a JSON object indicating “valid”/“invalid” and, if invalid, the violated rule(s). Accuracy (precision/recall) and adherence (whether the output matched the JSON schema) were recorded.

Results & Findings

ModelPrompt adherenceBest F1 (syntactic rules)Worst F1 (perceptual rules)
Gemma‑3 27B100 %0.820.12
Gemma‑3 4B100 %0.780.09
GPT‑oss 20B98 %0.800.15
  • High adherence: All models reliably produced correctly‑structured JSON responses, confirming that LLMs can follow strict output formats when prompted.
  • Rule‑type disparity: Models excelled at syntactic constraints (e.g., presence of axis titles, correct data types) but struggled with perceptual constraints that require visual reasoning (e.g., “avoid using red‑green color pairs for categorical data”).
  • Prompt impact: Translating ASP constraints into plain English boosted the smaller 4B model’s F1 by ~150 % for several rule categories, indicating that prompt clarity matters more for limited‑capacity models.
  • ASP‑derived vs. natural language: When the prompt directly echoed the ASP formulation, performance dropped across the board, suggesting that LLMs are better at reasoning over human‑readable descriptions than formal logic strings.

Practical Implications

  • LLM‑driven chart validators – Developers can embed an LLM (e.g., a locally‑run Gemma‑3) into data‑pipeline tooling to automatically flag obvious design violations before charts are rendered, reducing the need for hand‑crafted rule engines.
  • Rapid prototyping – Because LLMs require only natural‑language prompts, teams can extend validation to new design guidelines without writing new symbolic constraints, accelerating UI/UX iteration cycles.
  • Hybrid systems – The stark contrast between syntactic and perceptual performance suggests a practical architecture: use an LLM for quick, high‑recall checks of structural rules, and fall back to a symbolic solver (like Draco) for the more nuanced perceptual checks.
  • Developer tooling – IDE extensions or CI/CD hooks could automatically scan Vega‑Lite (or Altair, Plotly) specifications, returning JSON reports that integrate seamlessly with existing linting workflows.

Limitations & Future Work

  • Scope of rules – Only a subset of Draco’s constraints was evaluated; many advanced perceptual rules remain untested.
  • Model size vs. cost – While 27B‑parameter models performed best, they may be prohibitive for on‑device or low‑latency use cases.
  • Visual reasoning gap – LLMs lack direct access to rendered images, limiting their ability to assess visual properties that depend on pixel‑level perception.
  • Future directions – The authors propose (1) coupling LLMs with image‑based perception models, (2) expanding the benchmark to cover multi‑rule violations, and (3) exploring few‑shot prompting strategies to improve perceptual rule detection without increasing model size.

Authors

  • Martin Sinnona
  • Valentin Bonas
  • Emmanuel Iarussi
  • Viviana Siless

Paper Information

  • arXiv ID: 2602.20137v1
  • Categories: cs.CV
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »