[Paper] AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Published: (April 17, 2026 at 11:27 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16158v1

Overview

The paper AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency tackles a subtle but critical problem in modern large‑language‑model (LLM) pipelines: the chain‑of‑thought (CoT) explanations that accompany a model’s answer often look convincing without actually influencing the final prediction. The authors propose a reinforcement‑learning (RL) framework that teaches the model to generate faithful reasoning traces—i.e., explanations whose tokens truly matter for the answer.

Key Contributions

  • Differentiable attention mask that learns to highlight the CoT tokens most responsible for the model’s final answer.
  • Saliency‑based reward derived from the attention mask, encouraging the model to produce reasoning that genuinely drives the outcome.
  • Integration with GRPO (Generalized Reward‑Based Policy Optimization) to jointly optimize for answer correctness and explanation faithfulness.
  • Empirical validation on two benchmark suites (GSM8K math problems and MMLU knowledge tasks) using Llama‑3.2‑3B‑Instruct, showing measurable gains in both accuracy and interpretability.
  • Open‑source implementation (released with the paper) that can be plugged into existing instruction‑tuned LLM pipelines.

Methodology

  1. Baseline CoT Generation – The model first produces a standard chain‑of‑thought (a sequence of reasoning steps) followed by a final answer token.
  2. Additive Attention Mask – An auxiliary network learns a soft mask over the CoT tokens. The mask is added to the model’s internal attention scores, effectively “turning up” the influence of selected tokens.
  3. Saliency Reward – After a forward pass, the authors compute how much the masked attention changes the probability of the correct answer. Tokens that cause a larger positive shift receive higher saliency scores, which are summed into a reward signal.
  4. Outcome Reward – A conventional correctness reward (e.g., +1 for a right answer, 0 otherwise) is also computed.
  5. Joint Optimization with GRPO – Both rewards are fed into the GRPO algorithm, a policy‑gradient method that can handle multiple, possibly competing, objectives. The model’s parameters and the attention‑mask network are updated simultaneously.
  6. Training Loop – The process repeats across many examples, gradually shaping the model to prefer reasoning steps that are both correct and causally linked to the answer.

The whole pipeline stays fully differentiable, so it can be trained end‑to‑end without needing external supervision for the saliency map.

Results & Findings

DatasetBase Llama‑3.2‑3B‑InstructAtManRL (Ours)
GSM8K (math)45.2 % exact match48.9 % (+3.7 pts)
MMLU (multi‑subject)38.5 %41.2 % (+2.7 pts)
  • Saliency detection: Visualizations show the learned mask consistently highlights the pivotal arithmetic operation or factual statement that determines the answer.
  • Interpretability boost: Human evaluators rated AtManRL’s explanations as more “trustworthy” (average Likert score 4.2/5 vs. 3.5/5 for the baseline).
  • Training stability: The combined reward does not degrade convergence; training time increases by ~15 % due to the extra mask network, which is modest for a 3‑B parameter model.

Practical Implications

  • Debuggable AI services: Developers can surface the saliency mask to users or internal auditors, offering a concrete “why this answer?” that is backed by the model’s own attention dynamics.
  • Safety & compliance: In regulated domains (finance, healthcare), being able to prove that a decision was driven by specific reasoning steps can satisfy audit requirements and reduce liability.
  • Improved prompt engineering: Knowing which tokens the model deems influential helps engineers craft better CoT prompts or fine‑tune downstream models for tasks like automated tutoring or code generation.
  • Plug‑and‑play RL layer: Since AtManRL builds on GRPO, teams already using RLHF pipelines can add the saliency reward with minimal code changes, gaining interpretability without sacrificing performance.

Limitations & Future Work

  • Scale: Experiments are limited to a 3‑B parameter model; it remains unclear how the approach scales to 30‑B or larger LLMs where attention patterns are more diffuse.
  • Reward balance: Tuning the weighting between correctness and saliency rewards is still heuristic; an automated curriculum could make the method more robust.
  • Domain specificity: The saliency mask works well for tasks with a clear causal chain (math, factual QA) but may struggle on open‑ended generation where “influence” is harder to quantify.
  • Future directions suggested by the authors include extending the mask to multi‑head attention, exploring hierarchical saliency (sentence‑level vs. token‑level), and integrating with human‑in‑the‑loop feedback to further align explanations with user expectations.

Authors

  • Max Henning Höth
  • Kristian Kersting
  • Björn Deiseroth
  • Letitia Parcalabescu

Paper Information

  • arXiv ID: 2604.16158v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »