Fine-tuning For Domain-Customized Retriever Noise Mitigation in RAG Pipelines

Published: (December 12, 2025 at 05:58 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Authors (Affiliation: IBM Research, India)

  • Padmanabha V. Seshadri
  • Rudra Murthy
  • Arkadeep Acharya
  • Jaydeep Sen
  • Kushagra Bhushan
  • Yatin Nandwani
  • Praveen Jayachandran
  • Ashok Pon Kumar

RAG pipelines are the go‑to framework to support Conversational AI with domain‑specific customization. A typical system involves a set of documents containing domain knowledge that serve as the source of knowledge. End‑users pose a query, which triggers retrieval of query‑context‑specific chunks from the document set; these chunks are then infused as context along with the query.

The LLM powering the system receives the chunks and the query as input and should generate a response using the contextual chunks.

However, retrieval is not fool‑proof. Retrieved chunks may appear relevant to the retriever but be irrelevant to the query. Moreover, the ground‑truth answer might be expressed in a paraphrased yet correct manner, which is acceptable to end‑users in a conversational interaction. To help the LLM handle such paraphrasing, answer variations need to be infused during fine‑tuning.

To address these challenges, we conducted an ablation study to identify a suitable fine‑tuning data recipe that mitigates retriever noise, using IBM’s Granite 4 hybrid models (link) and BharatGen’s sovereign models (link) for real‑world use cases in agriculture and finance.

Domain‑specific data recipe: An Overview

Figure 1 illustrates the steps involved in the data recipe. The input to the pipeline is a set of documents. There are two main stages in processing these documents and generating a dataset:

  • Documents‑to‑samples: Converts the documents into question‑and‑answer (QA) pairs.
  • Sample augmentation: Augments the QA pairs first by generating distractors (the RAFT [3] method) and then by answer paraphrasing (the PA‑RAG [4] method).

Figure 1: Illustration of end‑to‑end domain‑specific data generation

Figure 1: Illustration of end‑to‑end domain‑specific data generation

The steps are elaborated in the sections below.

Documents‑To‑Samples

We first chunk the documents using tools such as docling, then generate QA pairs associated with the chunked data. This forms the initial training set for domain customization. The process follows a synthetic data generation (SDG) approach, illustrated in Figure 2.

SDG Flow

  1. Chunking & Embedding – Break documents into chunks and store them in a VectorDB.

  2. Synthetic Q&A Generation – Use LLMs to create question‑answer pairs via the SDGHub [5] framework. The following models were employed, and their outputs were mixed:

  3. Scoring – An LLM‑based scorer evaluates:

    • Answerability: Whether a query can be answered from the document/passage.
    • Faithfulness: Whether the answer faithfully reflects the source material.

    Samples failing either criterion are filtered out.

Figure 2: Flow of synthetic data generation

Figure 2: Flow of synthetic data generation

Sample Augmentation

The synthetically generated training set is further enriched with distractor and paraphrasing strategies to make the generator robust to retriever noise.

Creating distractors with RAFT [3]

We apply the Retrieval‑augmented fine‑tuning (RAFT) post‑training recipe:

  • For each query, retrieve the top‑k matching chunks from the VectorDB.
  • Chunks that do not match the gold chunk become distractors.
  • Distractor chunks are added to the sample, optionally alongside the gold chunk, based on a sampling proportion p.

Paraphrasing with PA‑RAG [4]

For each training question, we generate multiple paraphrased answers (e.g., three paraphrases per answer). Fine‑tuning with these variations helps the LLM internalize domain knowledge and become tolerant to answer phrasing differences.

Tuning Configuration

Evaluation dataset preparation

  • ~2,000 evaluation samples are drawn from the training dataset.
  • Each eval sample shares at least one training sample that uses the same gold chunk, ensuring relevance.
  • Chunk IDs are used to enforce this coverage constraint.

Training configuration

Fine‑tuning was performed with the fms‑hf‑tuning [6] stack:

ParameterValue
Learning rate1e-5
Warm‑up ratio0.01
Gradient accumulation steps1
Number of epochs3
OptimizerAdamW (β₁=0.9, β₂=0.98, ε=1e-10)

Hardware

  • 1 node × 8 × NVIDIA A100 80 GB GPUs.

Models and baselines

  • Agriculture use‑case: Fine‑tuned Granite 4.0 tiny hybrid [1] (7 B parameters, MoE + Mamba + Transformer).
  • Finance use‑case: Fine‑tuned BharatGen’s FinanceParam [2] (derived from BharatGen Param‑1‑2.9B‑Instruct).

Results

Evaluation metrics

  1. Rouge‑L [7] – Measures exact longest‑common‑subsequence n‑gram overlap between generated and gold responses; does not reward paraphrasing.
  2. LLM‑as‑a‑Judge – Uses the meta‑llama/Llama‑3.3‑70B‑Instruct model (link) to assess answer quality, including paraphrastic correctness.
Back to Blog

Related posts

Read more »

Renuncio a hacer consultoría de FinOps

Hace unos meses inicié a apoyar a diferentes clientes en la implementación de estrategias de optimización de recursos e infraestructura. Fue una decisión comple...