[Paper] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Published: 3 days ago (February 19, 2026 at 12:11 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17558v1

Overview

The paper introduces RetouchIQ, a system that lets large multimodal language models (MLLMs) act as intelligent assistants for professional image‑retouching tools. By combining instruction‑following language capabilities with a novel “generalist” reward model, the framework can translate high‑level user requests (e.g., “make the portrait look softer”) into concrete, executable edits inside real photo‑editing software, while learning to improve its actions through reinforcement learning.

Key Contributions

Instruction‑to‑action pipeline: Converts natural‑language editing intents into precise parameter settings for standard image‑editing operations (exposure, contrast, hue, etc.).
Generalist reward model: An RL‑fine‑tuned MLLM that generates case‑specific evaluation metrics and produces scalar feedback, moving beyond brittle pixel‑wise similarity scores.
Large curated dataset: 190 k instruction‑reasoning pairs covering diverse retouching scenarios, released as a new benchmark for instruction‑based image editing.
RL‑driven fine‑tuning: Uses the reward model to provide high‑quality gradients, enabling the MLLM agent to learn optimal tool‑use plans without needing explicit ground‑truth edit parameters.
Empirical gains: Demonstrates substantial improvements in semantic consistency (the edit matches the instruction) and perceptual quality over prior MLLM‑based and diffusion‑based editing approaches.

Methodology

Instruction Parsing – A base MLLM reads the user’s textual command and produces a structured “reasoning” output that outlines which editing tools are needed and why.
Action Generation – The reasoning is fed into a lightweight controller that maps each suggested tool to concrete parameter values (e.g., Brightness +0.12). These commands are directly executable in Photoshop‑like APIs.
Generalist Reward Model – A separate MLLM, fine‑tuned with reinforcement learning, looks at the original image, the edited result, and the original instruction. It synthesizes a set of evaluation metrics (color fidelity, style adherence, artifact detection) and collapses them into a single scalar reward.
RL Fine‑Tuning Loop – The primary MLLM agent receives the reward signal and updates its policy to produce better reasoning/action sequences. The loop runs entirely on synthetic data from the curated dataset, avoiding costly human annotations.

The whole pipeline is end‑to‑end trainable yet remains modular: the reward model can be swapped out or extended without retraining the core agent.

Results & Findings

Semantic Consistency: On the new benchmark, RetouchIQ achieves a 23 % higher instruction‑match score than the strongest diffusion‑based baseline.
Perceptual Quality: Human raters preferred RetouchIQ’s outputs 68 % of the time over competing MLLM editors, citing fewer artifacts and more natural tones.
Reward Model Effectiveness: Ablation studies show that replacing the generalist reward with a traditional pixel‑wise similarity metric drops performance by ~15 % in both consistency and visual quality, confirming the value of case‑specific reasoning.
Execution Fidelity: The generated parameter sets successfully run in Adobe Lightroom/Photoshop APIs with a 99 % success rate, demonstrating that the system produces truly executable edits rather than just image‑to‑image transformations.

Practical Implications

Developer‑friendly SDK: Because the output is a list of standard tool commands, developers can embed RetouchIQ into existing photo‑editing pipelines, plugins, or cloud services without re‑implementing the editing engine.
Creative Assistants: UI/UX teams can build “smart retouch” buttons that interpret vague user prompts (“make this skin smoother”) and automatically apply the right combination of adjustments, speeding up workflows for photographers, marketers, and social‑media creators.
Explainable Automation: The reasoning trace (which tool, why, with what parameter) offers transparency—useful for compliance, audit trails, or teaching novice editors how professional retouching works.
Cross‑domain Extensibility: The generalist reward concept can be transferred to other tool‑heavy domains (video color grading, CAD modeling, audio mixing), where subjective quality is hard to capture with fixed metrics.

Limitations & Future Work

Subjectivity of Rewards: Although the reward model learns to generate case‑specific metrics, it still reflects the biases present in its training data and may struggle with highly artistic or culturally nuanced edits.
Dataset Coverage: The 190 k instruction‑reasoning pairs focus mainly on portrait and landscape retouching; extending to niche domains (e.g., medical imaging, scientific visualization) will require additional data.
Real‑Time Constraints: The RL fine‑tuning loop is computationally intensive; deploying the model for on‑device, low‑latency editing remains an open challenge.
User Interaction Loop: Current experiments assume a single instruction; future work could explore iterative dialogues where the user refines edits based on intermediate results.

RetouchIQ showcases how marrying large multimodal language models with a flexible, reasoning‑driven reward system can turn vague creative intent into concrete, high‑quality edits—opening a path toward truly intelligent, explainable assistants for professional visual content creation.

Authors

Qiucheng Wu
Jing Shi
Simon Jenni
Kushal Kafle
Tianyu Wang
Shiyu Chang
Handong Zhao

Paper Information

arXiv ID: 2602.17558v1
Categories: cs.CV
Published: February 19, 2026
PDF: Download PDF

[Paper] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting