Build a Domain-Specific Embedding Model in Under a Day

Published: 2 hours ago (March 20, 2026 at 03:38 PM EDT)

14 min read

Source: Hugging Face Blog

If you are building a RAG (Retrieval-Augmented Generation) system, you have likely hit this wall: Everything works… until it doesn’t. General-purpose embedding models are trained to understand the internet; not your contracts, manufacturing logs, proprietary chemical formulations or internal taxonomy. They capture broad semantic similarity, but they do not understand the fine-grained distinctions that matter in your domain. Fine-tuning an embedding model can improve the performance of your retrieval pipeline when off-the-shelf models fail to effectively capture domain-specific nuances. Despite how critical embeddings are to RAG performance, the process remains surprisingly fragmented, the skills required are specialized, and the time investment is daunting.

With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required. To help you hit the ground running, we are also releasing a ready-to-use synthetic training dataset generated from NVIDIA’s public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in both Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement - on a single GPU.

	🔗Quick Links to Dataset and Codes:

Embedding Model
GitHub

Synthetic dataset on NVIDIA’s public documents

  🧑‍💻Open Source Projects Recipe Integrates:

NeMo Data Designer for synthetic data generation
NeMo Automodel for embedding model training
BEIR for Information retrieval evaluation
NeMo Export-Deploy for ONNX/TensorRT conversion
NVIDIA NIM for production inference serving
```
  📋Prerequisites:
```
A directory of domain documents (text files - .txt, .md, or similar)
A valid NVIDIA API key (free at build.nvidia.com)

NVIDIA Ampere GPU or newer with at least 80GB memory (with Compute Capability >= 8.0)

This tutorial has been tested on 1xA100 (80GB), and 1xH100 (80GB)

By the end of this post, you’ll know how to answer: 📄 Generate training data from domain documents without labeled data 🎯 Use hard negative mining for effective contrastive training 🔗 Improve embedding quality with multi-hop queries ⚙️ Fine-tune a bi-encoder embedding model 📊 Evaluate whether fine-tuning improves retrieval 🚀 Deploy the fine-tuned model in your pipeline

	⚙️Setup

In this tutorial, we will finetune the base model Llama-Nemotron-Embed-1B-v2 - a 1-billion-parameter embedding model that balances quality and inference cost. To get started, follow this setup guide.

	📚 Step 1: Generate Training Data from Documents

Fine-tuning an embedding model requires thousands of (query, relevant document) pairs. Most use cases don’t have this data readily available. Creating it manually is expensive, slow, and often biased by the annotator’s personal interpretation of what’s “relevant.” Instead of labeling data by hand, you can use an LLM (nvidia/nemotron-3-nano-30b-a3b) to read your documents and automatically generate high-quality synthetic question–answer pairs.

nemotron embed sdg -c default corpus_dir=./data/my_domain_docs

	How does it work?

Behind the scenes, this runs a four-stage synthetic data generation (SDG) pipeline powered by NeMo Data Designer:

	What does the output look like?

Source document chunk:

The thermal design power (TDP) of the H100 GPU is 700W in SXM form factor. The cooling solution must maintain junction temperature below 83°C under sustained workloads. Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.

Generated QA pairs:

{
  "question": "What cooling approach is recommended when deploying more than 4 H100 GPUs per server node?",
  "answer": "Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.",
  "query_type": "contextual",
  "reasoning_type": "factual",
  "question_complexity": 3,
 

 "segment_ids": [1],
  "quality_score": 8.5
}

{
  "question": "How does the 700W TDP of the H100 SXM constrain the choice between air and liquid cooling in multi-GPU configurations?",
  "answer": "The 700W TDP generates substantial heat that must be dissipated to keep junction temperatures below 83°C. In dense configurations exceeding 4 GPUs per node, air cooling in standard 2U chassis cannot handle this thermal load, making liquid cooling necessary.",
  "query_type": "multi_hop",
  "reasoning_type": "causal",
  "question_complexity": 4,
  "segment_ids": [1, 2],
  "hop_count": 2,
  "quality_score": 9.0
}

Notice the difference: the first question is a simple factual lookup. The second requires multi-hop, causal reasoning. The pipeline generates both types, with configurable complexity levels (2–5) and hop counts (1–3). Each QA pair then undergoes quality evaluation, receiving sub-scores for relevance, accuracy, context support, and clarity, along with an overall score. Only pairs that meet the threshold are included in training.

	⛏️ Step 2: Mine Hard Negatives (and Why They Matter)

If you train an embedding model with only positive pairs (query + correct document), it learns to distinguish obviously different documents but fails on the hard cases — passages that look relevant but are not the right answer. In a real retrieval system, these near-misses are exactly the documents that cause bad answers. Hard negative mining finds these confusing passages so the model can learn to tell them apart.

nemotron embed prep -c default

The above command runs three sub-steps automatically:

	2a. Train / Validation / Test Split

The generated QA pairs are split into training (80%) and test (20%) sets. The test set is formatted as a BEIR-compatible benchmark for standardized evaluation in Step 5.

	2b. Hard Negative Mining

Using the base embedding model, the pipeline:

Embeds every query and every passage in the corpus.
Computes similarity between each query and all passages.
Masks out each query’s labeled positive documents.
Applies a margin filter: any non-positive document scoring above 95% of the minimum positive score is eliminated. This exclusion zone guards against false negatives — unlabeled passages that are so close to the positive they may actually be relevant.
From the surviving candidates, selects the top-k highest-scoring documents as hard negatives (5 per query by default).

The result: hard negatives are the most similar non-positive passages that still fall safely below the positive-score ceiling. They are passages the current model considers highly relevant but that are not the labeled answer.

Why this works: Training on easy negatives (completely unrelated passages) teaches the model nothing new. Training on hard negatives forces it to learn the subtle distinctions that matter in your domain. For example, in a medical corpus, a question about “metformin dosage for Type 2 diabetes” might have hard negatives about “metformin side effects” or “insulin dosage for Type 1 diabetes” — close but critically different. The 95% margin ceiling prevents the miner from selecting passages that are too close to the positive, which could actually be correct answers that simply weren’t labeled during SDG.

	2c. Multi-Hop Unrolling

Multi-hop questions reference multiple positive documents. For example, a question like “How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?” has two positive passages.

Unrolling creates one training example per (query, positive document) pair, so the contrastive loss sees each positive independently. A question with 2 positive documents becomes 2 training examples, each with the same hard negatives but a different positive.

The final output is a training-ready JSON file:

{
  "question_id": "q42_0",
  "question": "How does the 700W TDP of the H100 SXM constrain cooling choices in multi-GPU nodes?",
  "pos_doc": [{"id": "d_a1b2c3"}],
  "neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}

	🔍 Step 3: Understand Multi-Hop Questions and Why They Improve Retrieval

Standard embedding fine-tuning generates one question per passage and trains the model to match them. This works for simple factual lookups, but real users ask complex questions that span multiple documents or sections. If the model has only seen single-hop training data, it will struggle to retrieve all the relevant passages for these complex queries.

The SDG pipeline generates questions at 1 to 3 hops by default:

1-hop: “What is the TDP of the H100 SXM?” — answered by a single passage.
2-hop: “How does the H100’s TDP relate to cooling requirements in dense deployments?” — requires connecting information from two passages.
3-hop: “Given the TDP, cooling constraints, and rack density limits, what is the maximum number of H100 GPUs deployable in a standard data center row?” — synthesizes three passages.

Each hop is tracked with its own context summary and segment IDs, so the training data preserves the full reasoning chain. After unrolling (Step 2c), each (question, relevant passage) pair becomes an independent training signal, teaching the model that all of these passages are relevant to the multi-hop query.

The fine-tuned model learns to retrieve contextually related documents, not just lexically similar ones.

	🧠 Step 4: Fine-Tune the Embedding Model

nemotron embed finetune -c default

	How contrastive learning works

The training uses a biencoder architecture with contrastive loss.

[ contrastive_learning ](https://cdn-uploads.huggingface.co/pr

oduction/uploads/662a9c2154a9f1037104a075/ntYmgzatUK_Sfn35GCdve.png)

The temperature of 0.02 is deliberately aggressive, it produces a very sharp probability distribution. This works well because the hard negatives from Step 2 are high-quality: they are genuinely confusing passages that the model needs strong gradients to learn to distinguish.

	Key hyperparameters

Parameter Default Notes

Epochs 3 For large dataset, you may lower it to 2 or 1

Learning rate 1e-5 Tuning: try double and half of the default value

Learning rate warmup steps 5 Set to 5-10% of total steps of finetune to have better early training stability

Global batch size 128 Auto-scaled down for small datasets

Passages per query 5 1 positive + 4 hard negatives

	Auto-scaling for small datasets

If your dataset has fewer than 2,000 training examples, the pipeline automatically:

Reduces the batch size (to 16–64) so gradients are meaningful.
Adjusts checkpoint frequency to ensure at least three checkpoints per run.
Scales validation frequency proportionally.

This means you can start with a small corpus (50–100 documents) for a quick proof-of-concept and scale up later.

	📈 Step 5: Measure the Improvement

Did fine-tuning actually help? Let’s find out by running a standardized evaluation comparing the base model against the fine-tuned checkpoint on the held-out test set:

nemotron embed eval -c default

The evaluation uses the BEIR framework and computes four standard information retrieval metrics at k = 1, 5, 10, and 100:

nDCG@k: Ranking quality — are the best documents ranked highest?
Recall@k: Coverage — what fraction of relevant documents appear in the top k?
Precision@k: Accuracy — what fraction of the top k results are actually relevant?
MAP@k: Average precision across all queries

A successful fine-tune typically results in a 15% improvement in nDCG@10 and Recall@10 within Fine-tuned)

NDCG: NDCG@1: 0.55178 → 0.60796 (+0.05618, +10.2%) NDCG@5: 0.51894 → 0.57689 (+0.05795, +11.2%) NDCG@10: 0.55506 → 0.61559 (+0.06053, +10.9%) NDCG@100: 0.60617 → 0.66567 (+0.05950, +9.8%) Recall: Recall@1: 0.28478 → 0.31547 (+0.03069, +10.8%) Recall@5: 0.54486 → 0.60288 (+0.05802, +10.6%) Recall@10: 0.62979 → 0.69296 (+0.06317, +10.0%) Recall@100: 0.81421 → 0.87020 (+0.05599, +6.9%)


	
		
	
	
		What if the numbers don't improve?
	

The pipeline makes it easy to iterate:

- **Low quality scores in SDG?** Check your document quality — clean, well-formatted text produces better synthetic data. Try a larger and more powerful LLM.  

- **Not enough training data?** Add more documents to your corpus and re-run Stage 0.  

- **Overfitting?** Reduce epochs or increase the quality threshold to keep only the best training examples.  

- **Wrong learning rate?** Try 5e-6 for larger datasets or 2e-5 for very small ones.

	
		
	
	
		🏆 Real-World Results: Atlassian
	

This recipe has been validated on real enterprise data by Atlassian. They applied this pipeline to fine-tune Llama-Nemotron-Embed-1B-v2 on a [public Jira dataset](https://zenodo.org/records/15719919/files/2025-06-23%20ThePublicJiraDataset.zip?download=1) using a single NVIDIA A100 80GB GPU, following the same stages described above

**Recall@60 jumped from 0.751 to 0.951 — a 26.7% gain.**
The fine-tuned model retrieves the correct document within the top 60 results for 95.1% of queries, up from 75.1% with the base model. For a retrieval system underpinning Jira search, this directly translates into more relevant results for millions of users. Find more details in their blog post [Advancing semantic search for millions of Rovo users](https://www.atlassian.com/blog/atlassian-engineering/advancing-rovo-semantic-search).  

[![ATL result](https://cdn-uploads.huggingface.co/production/uploads/662a9c2154a9f1037104a075/hsFtTLz1WfSgaWBdHof82.png)](https://cdn-uploads.huggingface.co/production/uploads/662a9c2154a9f1037104a075/hsFtTLz1WfSgaWBdHof82.png)

	
		
	
	
		🚀 Step 6: Export and Deploy
	

A PyTorch checkpoint is great for evaluation but too slow for production. The final two stages convert the model and serve it behind an API.

	
		
	
	
		Export to ONNX / TensorRT

nemotron embed export -c default


This exports the fine-tuned checkpoint to ONNX (opset 17). Optionally, it compiles a TensorRT engine for maximum inference throughput, with configurable optimization profiles for batch size (1–64) and sequence length (3–256):

ONNX only (runs anywhere)

nemotron embed export -c default export_to_trt=false

FP8 quantization for further speedup

nemotron embed export -c default quant_cfg=fp8


	
		
	
	
		Deploy with NVIDIA NIM
	

The exported model is deployed inside an [NVIDIA NIM](https://developer.nvidia.com/nim) container — a production-ready inference microservice exposing an OpenAI-compatible /v1/embeddings endpoint:

nemotron embed deploy -c default


Once running, any client can call it:

curl -X POST http://localhost:8000/v1/embeddings
-H “Content-Type: application/json”
-d ’{“input”: [“What cooling is needed for 8 H100 GPUs in a 2U chassis?”], “model”: “custom”, “input_type”: “query”}’


Because NIM serves an OpenAI-compatible API, you can drop it into any existing RAG pipeline that uses the embeddings API format — no code changes needed.

	
		
	
	
		Verify deployment accuracy
	

The pipeline includes a NIM accuracy verification step that runs the same BEIR evaluation against the deployed endpoint:

nemotron embed eval -c default eval_nim=true eval_base=false


This catches any accuracy loss from the ONNX/TensorRT conversion. Metrics that match within tolerance (0.03 for @1, 0.01 for @5+) are marked with a ch

eck; deviations beyond conversion noise are flagged.

	
		
	
	
		Putting It All Together
	

The full embedding fine-tuning pipeline can be run in six commands, from raw documents to a deployed model.

1. Generate synthetic training data from your documents

nemotron embed sdg -c default corpus_dir=./data/my_docs

2. Prepare the training data (split data, mine hard negatives, unroll)

nemotron embed prep -c default

3. Fine-tune the embedding model

nemotron embed finetune -c default

4. Evaluate the base vs. fine-tuned model

nemotron embed eval -c default

5. Export the optimized model

nemotron embed export -c default

6. Deploy the model

nemotron embed deploy -c default


	
		
	
	
		Expected time and resources
	

	
		
Stage
GPU Required?
Estimated Time
Notes

		
SDG
No (uses API)
~1 hour
Varies by corpus size and API rate limit

Data Prep
Yes (40 GB VRAM)
~5 min
Hard negative mining on GPU

Fine-Tune
Yes (80 GB VRAM)
~1 hours
Varies by dataset size and epochs

Eval
Yes (40 GB VRAM)
~5 min

Export
Yes (40 GB VRAM)
~5 min
TensorRT requires NGC container

Deploy
Yes (40 GB VRAM)
~5 min
NIM container startup

	

**Total: under a day, with most time being hands-off training.** For a small corpus (~500 documents), the entire pipeline completes in about 2–3 hours.

The pipeline can run end-to-end, but each stage can also be executed independently depending on your starting point. For example, if you have raw documents, you can begin with synthetic data generation (SDG), while datasets that already include hard negatives can skip earlier steps and go directly to fine-tuning. Since every stage uses standard formats such as JSON, BEIR, and ONNX, it’s easy to integrate custom components or reuse intermediate outputs in other workflows. The recipe is also flexible in how it runs, supporting execution on a local machine, inside Docker containers, or on Slurm-based clusters.

	
		
	
	
		Try It Yourself
	

If you have domain documents and some time in your hand, you can generate your first batch of synthetic training data today! The full pipeline - from documents to a deployed, domain-adapted embedding model - runs in under a day on a single GPU. You can start with our ready-made [nvidia/Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) dataset to try the pipeline right away. Let us know what you build.
**Star the repos** for [Nemotron](https://github.com/NVIDIA/Nemotron), [NeMo Data Designer](https://github.com/NVIDIA/NeMo-Data-Designer) and [NeMo Automodel](https://github.com/NVIDIA/NeMo-Automodel) if you find them useful.

Build a Domain-Specific Embedding Model in Under a Day

A successful fine-tune typically results in a 15% improvement in nDCG@10 and Recall@10 within Fine-tuned)

ONNX only (runs anywhere)

FP8 quantization for further speedup

1. Generate synthetic training data from your documents

2. Prepare the training data (split data, mine hard negatives, unroll)

3. Fine-tune the embedding model

4. Evaluate the base vs. fine-tuned model

5. Export the optimized model

6. Deploy the model

Related posts

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Holotron-12B - High Throughput Computer Use Agent

A successful fine-tune typically results in a 15% improvement in nDCG@10 and Recall@10 within Fine-tuned)

ONNX only (runs anywhere)

FP8 quantization for further speedup

1. Generate synthetic training data from your documents

2. Prepare the training data (split data, mine hard negatives, unroll)

3. Fine-tune the embedding model

4. Evaluate the base vs. fine-tuned model

5. Export the optimized model

6. Deploy the model

Related posts

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Holotron-12B - High Throughput Computer Use Agent

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding