[Paper] Visual-ERM: Reward Modeling for Visual Equivalence

Published: 1 month ago (March 13, 2026 at 01:58 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.13224v1

Overview

The paper introduces Visual‑ERM, a reward‑modeling framework that evaluates the quality of vision‑to‑code systems (e.g., turning a chart image into executable code) directly in the rendered visual space. By providing fine‑grained, interpretable feedback, Visual‑ERM enables effective reinforcement learning (RL) for large vision‑language models, closing the gap between visual fidelity and code correctness.

Key Contributions

Visual Equivalence Reward Model (Visual‑ERM): a multimodal generative reward model that judges output code by rendering it and comparing the resulting image to the ground‑truth visual input.
Task‑agnostic fine‑grained feedback: unlike prior text‑only or coarse embedding rewards, Visual‑ERM captures pixel‑level discrepancies and offers interpretable error signals.
RL integration with LVLMs: applied to Qwen3‑VL‑8B‑Instruct, yielding +8.4 points on chart‑to‑code, and +2.7 / +4.1 average gains on table and SVG parsing.
Reflection & revision at inference: the model can self‑critique and iteratively improve its outputs without extra training.
VC‑RewardBench: a new benchmark for measuring fine‑grained visual equivalence across structured visual data, showing Visual‑ERM (8B) surpasses a 235B‑parameter closed‑source baseline.

Methodology

Data Preparation – Collect paired datasets of structured visual inputs (charts, tables, SVGs) and their corresponding source code (e.g., Matplotlib, HTML/CSS).
Reward Model Architecture – Visual‑ERM combines a vision encoder (to process rendered images) with a language decoder (to generate a scalar reward). It is trained to predict a high reward when the rendered output matches the reference image and a low reward otherwise.
Fine‑grained Supervision – The loss incorporates pixel‑level similarity (e.g., SSIM), perceptual features (e.g., CLIP embeddings), and a learned “visual equivalence” head that highlights specific mismatches (missing axis labels, wrong colors, mis‑aligned cells).
RL Loop – The LVLM (Qwen3‑VL‑8B‑Instruct) generates code conditioned on an input image. The code is rendered, fed to Visual‑ERM, and the predicted reward guides policy gradient updates (PPO).
Reflection & Revision – At test time, the model queries Visual‑ERM for a “critique” of its own output, then revises the code iteratively until the reward plateaus.

Results & Findings

Task	Baseline (Supervised)	Visual‑ERM RL	Δ (pts)
Chart‑to‑code	71.2	79.6	+8.4
Table parsing	68.5	71.2	+2.7
SVG generation	63.8	67.9	+4.1

On VC‑RewardBench, Visual‑ERM (8B) outperformed Qwen3‑VL‑235B‑Instruct by a 12‑point margin and approached the performance of leading closed‑source models (e.g., GPT‑4V).
Ablation studies show that removing pixel‑level loss drops RL gains by >50 %, confirming the necessity of fine‑grained visual signals.
The reflection/revision step adds an extra 1.5‑2.0 % boost without retraining.

Practical Implications

Developer tools: IDE plugins that auto‑generate chart or UI code from screenshots can now rely on RL‑fine‑tuned models that guarantee visual fidelity, reducing manual tweaking.
Data pipelines: Automated extraction of tables or SVGs from PDFs can be made more reliable, cutting downstream cleaning costs.
Low‑resource deployment: Visual‑ERM achieves strong results with an 8B model, making it feasible to run on commodity GPUs for SaaS products.
Iterative design assistants: The reflection/revision capability enables “design‑in‑the‑loop” assistants that suggest improvements until the visual output matches the designer’s intent.

Limitations & Future Work

Render dependency: The reward requires a deterministic rendering engine; variations across browsers or graphics libraries could affect consistency.
Computation overhead: Rendering and evaluating each candidate during RL adds latency, which may be prohibitive for real‑time applications.
Scope of visual structures: The current benchmark focuses on charts, tables, and SVGs; extending to more complex layouts (e.g., dashboards) remains open.
Generalization: While task‑agnostic, Visual‑ERM still benefits from domain‑specific fine‑tuning; future work could explore zero‑shot visual equivalence across unseen visual domains.

Authors

Ziyu Liu
Shengyuan Ding
Xinyu Fang
Xuanlang Dai
Penghui Yang
Jianze Liang
Jiaqi Wang
Kai Chen
Dahua Lin
Yuhang Zang

Paper Information

arXiv ID: 2603.13224v1
Categories: cs.CV, cs.AI
Published: March 13, 2026
PDF: Download PDF

[Paper] Visual-ERM: Reward Modeling for Visual Equivalence

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

[Paper] Representation Learning for Spatiotemporal Physical Systems

[Paper] Towards Faithful Multimodal Concept Bottleneck Models

[Paper] Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics