Fine-tuning For Domain-Customized Retriever Noise Mitigation in RAG Pipelines

Published: 2 hours ago (December 12, 2025 at 05:58 AM EST)

3 min read

Source: Dev.to

Authors (Affiliation: IBM Research, India)

Padmanabha V. Seshadri
Rudra Murthy
Arkadeep Acharya
Jaydeep Sen
Kushagra Bhushan
Yatin Nandwani
Praveen Jayachandran
Ashok Pon Kumar

RAG pipelines are the go‑to framework to support Conversational AI with domain‑specific customization. A typical system involves a set of documents containing domain knowledge that serve as the source of knowledge. End‑users pose a query, which triggers retrieval of query‑context‑specific chunks from the document set; these chunks are then infused as context along with the query.

The LLM powering the system receives the chunks and the query as input and should generate a response using the contextual chunks.

However, retrieval is not fool‑proof. Retrieved chunks may appear relevant to the retriever but be irrelevant to the query. Moreover, the ground‑truth answer might be expressed in a paraphrased yet correct manner, which is acceptable to end‑users in a conversational interaction. To help the LLM handle such paraphrasing, answer variations need to be infused during fine‑tuning.

To address these challenges, we conducted an ablation study to identify a suitable fine‑tuning data recipe that mitigates retriever noise, using IBM’s Granite 4 hybrid models (link) and BharatGen’s sovereign models (link) for real‑world use cases in agriculture and finance.

Domain‑specific data recipe: An Overview

Figure 1 illustrates the steps involved in the data recipe. The input to the pipeline is a set of documents. There are two main stages in processing these documents and generating a dataset:

Documents‑to‑samples: Converts the documents into question‑and‑answer (QA) pairs.
Sample augmentation: Augments the QA pairs first by generating distractors (the RAFT [3] method) and then by answer paraphrasing (the PA‑RAG [4] method).

Figure 1: Illustration of end‑to‑end domain‑specific data generation

The steps are elaborated in the sections below.

Documents‑To‑Samples

We first chunk the documents using tools such as docling, then generate QA pairs associated with the chunked data. This forms the initial training set for domain customization. The process follows a synthetic data generation (SDG) approach, illustrated in Figure 2.

SDG Flow

Chunking & Embedding – Break documents into chunks and store them in a VectorDB.
Synthetic Q&A Generation – Use LLMs to create question‑answer pairs via the SDGHub [5] framework. The following models were employed, and their outputs were mixed:
- openai/gpt-oss-120b
- mistralai/Mixtral-8x7B-v0.1
Scoring – An LLM‑based scorer evaluates:
- Answerability: Whether a query can be answered from the document/passage.
- Faithfulness: Whether the answer faithfully reflects the source material.
Samples failing either criterion are filtered out.

Figure 2: Flow of synthetic data generation

Sample Augmentation

The synthetically generated training set is further enriched with distractor and paraphrasing strategies to make the generator robust to retriever noise.

Creating distractors with RAFT [3]

We apply the Retrieval‑augmented fine‑tuning (RAFT) post‑training recipe:

For each query, retrieve the top‑k matching chunks from the VectorDB.
Chunks that do not match the gold chunk become distractors.
Distractor chunks are added to the sample, optionally alongside the gold chunk, based on a sampling proportion p.

Paraphrasing with PA‑RAG [4]

For each training question, we generate multiple paraphrased answers (e.g., three paraphrases per answer). Fine‑tuning with these variations helps the LLM internalize domain knowledge and become tolerant to answer phrasing differences.

Tuning Configuration

Evaluation dataset preparation

~2,000 evaluation samples are drawn from the training dataset.
Each eval sample shares at least one training sample that uses the same gold chunk, ensuring relevance.
Chunk IDs are used to enforce this coverage constraint.

Training configuration

Fine‑tuning was performed with the fms‑hf‑tuning [6] stack:

Parameter	Value
Learning rate	`1e-5`
Warm‑up ratio	`0.01`
Gradient accumulation steps	`1`
Number of epochs	`3`
Optimizer	AdamW (β₁=`0.9`, β₂=`0.98`, ε=`1e-10`)

Hardware

1 node × 8 × NVIDIA A100 80 GB GPUs.

Models and baselines

Agriculture use‑case: Fine‑tuned Granite 4.0 tiny hybrid [1] (7 B parameters, MoE + Mamba + Transformer).
Finance use‑case: Fine‑tuned BharatGen’s FinanceParam [2] (derived from BharatGen Param‑1‑2.9B‑Instruct).

Results

Evaluation metrics

Rouge‑L [7] – Measures exact longest‑common‑subsequence n‑gram overlap between generated and gold responses; does not reward paraphrasing.
LLM‑as‑a‑Judge – Uses the meta‑llama/Llama‑3.3‑70B‑Instruct model (link) to assess answer quality, including paraphrastic correctness.

Fine-tuning For Domain-Customized Retriever Noise Mitigation in RAG Pipelines

Authors (Affiliation: IBM Research, India)

Domain‑specific data recipe: An Overview

Documents‑To‑Samples

SDG Flow

Sample Augmentation

Creating distractors with RAFT [3]

Paraphrasing with PA‑RAG [4]

Tuning Configuration

Evaluation dataset preparation

Training configuration

Hardware

Models and baselines

Results

Evaluation metrics

Related posts

Renuncio a hacer consultoría de FinOps

Building a Quantum-Enhanced API Gateway: MCP Secure Gateway

Start Hacking Now: What a €XXM API Migration Taught Me About AI in Production

I Almost Used LangGraph for Social Media Automation (Here's Why I Built an MCP Server Instead)

Authors (Affiliation: IBM Research, India)

Domain‑specific data recipe: An Overview

Documents‑To‑Samples

SDG Flow

Sample Augmentation

Creating distractors with RAFT [3]

Paraphrasing with PA‑RAG [4]

Tuning Configuration

Evaluation dataset preparation

Training configuration

Hardware

Models and baselines

Results

Evaluation metrics

Related posts

Renuncio a hacer consultoría de FinOps

Building a Quantum-Enhanced API Gateway: MCP Secure Gateway

Start Hacking Now: What a €XXM API Migration Taught Me About AI in Production

I Almost Used LangGraph for Social Media Automation (Here's Why I Built an MCP Server Instead)

Creating distractors with RAFT [3]

Paraphrasing with PA‑RAG [4]