[Paper] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Published: 2 months ago (December 4, 2025 at 12:24 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05012v1

Overview

The paper proposes Self‑Explaining Contrastive Evidence Re‑ranking (CER), a new way to make Retrieval‑Augmented Generation (RAG) systems both more factual and transparent. By reshaping the embedding space with contrastive learning and attaching token‑level attribution rationales to each retrieved passage, CER forces the retriever to surface truly evidential content while pushing away subjective or misleading text. The authors demonstrate the approach on clinical‑trial reports, showing gains in retrieval accuracy and a reduction in hallucinations—an especially valuable advance for safety‑critical AI applications.

Key Contributions

Contrastive fine‑tuning of retriever embeddings using automatically mined hard negatives that are subjective rather than factual.
Token‑level attribution rationales generated for every retrieved passage, giving developers a clear, interpretable “why” behind each result.
Evidence‑aligned embedding space that clusters factual explanations together and separates misleading ones, improving downstream RAG generation quality.
Empirical validation on a clinical‑trial corpus, showing measurable improvements in retrieval precision and a drop in hallucinated outputs.
A lightweight, plug‑and‑play pipeline that can be added to existing retriever‑generator stacks without massive architectural changes.

Methodology

Data Preparation & Hard Negative Mining
- The authors start with a collection of documents (e.g., clinical‑trial reports).
- For each query, they automatically select subjective passages (e.g., opinionated language, hedging) as hard negatives, using a simple subjectivity classifier.
Contrastive Learning Objective
- The retriever’s dense embeddings are fine‑tuned with a contrastive loss:
  - Positive pairs = query ↔ factual passage (high‑quality evidence).
  - Negative pairs = query ↔ subjective passage.
- This pushes factual evidence closer in the vector space while pulling subjective text away.
Self‑Explaining Attribution
- After retrieval, each passage is passed through a lightweight attribution model (e.g., gradient‑based or attention‑based) that highlights the exact tokens responsible for the relevance score.
- The resulting token‑level heatmap is stored alongside the passage, giving a human‑readable explanation.
Integration with RAG
- The re‑ranked, annotated passages are fed to the generator component. Because the retrieved context is now evidence‑rich and transparent, the generator’s output is less prone to hallucination.

The whole pipeline can be dropped into existing RAG frameworks (e.g., Haystack, LangChain) with minimal code changes.

Results & Findings

Metric	Baseline Retriever	CER‑Enhanced Retriever
Top‑5 Retrieval Accuracy (clinical trials)	71.2 %	78.9 %
Hallucination Rate in Generated Answers	12.4 %	6.7 %
Average Attribution F1 (token‑level)	—	0.81

Higher retrieval precision: By explicitly teaching the model to separate factual from subjective content, CER lifts the proportion of truly relevant passages in the top‑k results.
Reduced hallucinations: When the generator receives cleaner, evidence‑backed context, it is far less likely to fabricate unsupported statements.
Transparent evidence: The token‑level rationales allow developers (and end‑users) to inspect why a passage was considered relevant, a crucial feature for auditability in regulated domains.

Practical Implications

Safer AI assistants: Deployments in healthcare, finance, or legal advice can benefit from a built‑in guardrail that curtails hallucinations and provides traceable evidence.
Debugging & compliance: Token‑level attributions make it easier to audit retrieval pipelines, satisfy regulatory requirements, and quickly pinpoint why a model made a particular decision.
Improved user trust: Showing users the exact evidence supporting a generated answer can boost confidence, especially in high‑stakes settings.
Plug‑and‑play upgrade: Teams already using dense retrievers (e.g., FAISS, Milvus) can adopt CER with a modest fine‑tuning step and an attribution layer, without redesigning the entire RAG stack.
Better downstream training: The evidence‑aligned embeddings can be reused for other tasks—like fact‑checking, summarization, or citation generation—making the investment reusable across multiple products.

Limitations & Future Work

Domain specificity: The current experiments focus on clinical‑trial texts; performance on more heterogeneous corpora (e.g., news, code) remains to be validated.
Subjectivity classifier reliance: The quality of hard negatives depends on the initial subjectivity detector, which may introduce bias if not carefully calibrated.
Scalability of attribution: Token‑level rationales add computational overhead; optimizing for real‑time inference in large‑scale systems is an open challenge.
Future directions suggested by the authors include: extending CER to multi‑modal evidence (tables, figures), exploring self‑supervised subjectivity detection, and integrating the method with LLM‑native retrieval plugins for tighter end‑to‑end training.

Authors

Francielle Vargas
Daniel Pedronette

Paper Information

arXiv ID: 2512.05012v1
Categories: cs.CL
Published: December 4, 2025
PDF: Download PDF

[Paper] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis