[Paper] Visual-ERM: Reward Modeling for Visual Equivalence
Source: arXiv - 2603.13224v1
Overview
The paper introduces Visual‑ERM, a reward‑modeling framework that evaluates the quality of vision‑to‑code systems (e.g., turning a chart image into executable code) directly in the rendered visual space. By providing fine‑grained, interpretable feedback, Visual‑ERM enables effective reinforcement learning (RL) for large vision‑language models, closing the gap between visual fidelity and code correctness.
Key Contributions
- Visual Equivalence Reward Model (Visual‑ERM): a multimodal generative reward model that judges output code by rendering it and comparing the resulting image to the ground‑truth visual input.
- Task‑agnostic fine‑grained feedback: unlike prior text‑only or coarse embedding rewards, Visual‑ERM captures pixel‑level discrepancies and offers interpretable error signals.
- RL integration with LVLMs: applied to Qwen3‑VL‑8B‑Instruct, yielding +8.4 points on chart‑to‑code, and +2.7 / +4.1 average gains on table and SVG parsing.
- Reflection & revision at inference: the model can self‑critique and iteratively improve its outputs without extra training.
- VC‑RewardBench: a new benchmark for measuring fine‑grained visual equivalence across structured visual data, showing Visual‑ERM (8B) surpasses a 235B‑parameter closed‑source baseline.
Methodology
- Data Preparation – Collect paired datasets of structured visual inputs (charts, tables, SVGs) and their corresponding source code (e.g., Matplotlib, HTML/CSS).
- Reward Model Architecture – Visual‑ERM combines a vision encoder (to process rendered images) with a language decoder (to generate a scalar reward). It is trained to predict a high reward when the rendered output matches the reference image and a low reward otherwise.
- Fine‑grained Supervision – The loss incorporates pixel‑level similarity (e.g., SSIM), perceptual features (e.g., CLIP embeddings), and a learned “visual equivalence” head that highlights specific mismatches (missing axis labels, wrong colors, mis‑aligned cells).
- RL Loop – The LVLM (Qwen3‑VL‑8B‑Instruct) generates code conditioned on an input image. The code is rendered, fed to Visual‑ERM, and the predicted reward guides policy gradient updates (PPO).
- Reflection & Revision – At test time, the model queries Visual‑ERM for a “critique” of its own output, then revises the code iteratively until the reward plateaus.
Results & Findings
| Task | Baseline (Supervised) | Visual‑ERM RL | Δ (pts) |
|---|---|---|---|
| Chart‑to‑code | 71.2 | 79.6 | +8.4 |
| Table parsing | 68.5 | 71.2 | +2.7 |
| SVG generation | 63.8 | 67.9 | +4.1 |
- On VC‑RewardBench, Visual‑ERM (8B) outperformed Qwen3‑VL‑235B‑Instruct by a 12‑point margin and approached the performance of leading closed‑source models (e.g., GPT‑4V).
- Ablation studies show that removing pixel‑level loss drops RL gains by >50 %, confirming the necessity of fine‑grained visual signals.
- The reflection/revision step adds an extra 1.5‑2.0 % boost without retraining.
Practical Implications
- Developer tools: IDE plugins that auto‑generate chart or UI code from screenshots can now rely on RL‑fine‑tuned models that guarantee visual fidelity, reducing manual tweaking.
- Data pipelines: Automated extraction of tables or SVGs from PDFs can be made more reliable, cutting downstream cleaning costs.
- Low‑resource deployment: Visual‑ERM achieves strong results with an 8B model, making it feasible to run on commodity GPUs for SaaS products.
- Iterative design assistants: The reflection/revision capability enables “design‑in‑the‑loop” assistants that suggest improvements until the visual output matches the designer’s intent.
Limitations & Future Work
- Render dependency: The reward requires a deterministic rendering engine; variations across browsers or graphics libraries could affect consistency.
- Computation overhead: Rendering and evaluating each candidate during RL adds latency, which may be prohibitive for real‑time applications.
- Scope of visual structures: The current benchmark focuses on charts, tables, and SVGs; extending to more complex layouts (e.g., dashboards) remains open.
- Generalization: While task‑agnostic, Visual‑ERM still benefits from domain‑specific fine‑tuning; future work could explore zero‑shot visual equivalence across unseen visual domains.
Authors
- Ziyu Liu
- Shengyuan Ding
- Xinyu Fang
- Xuanlang Dai
- Penghui Yang
- Jianze Liang
- Jiaqi Wang
- Kai Chen
- Dahua Lin
- Yuhang Zang
Paper Information
- arXiv ID: 2603.13224v1
- Categories: cs.CV, cs.AI
- Published: March 13, 2026
- PDF: Download PDF