[Paper] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Published: 2 months ago (November 26, 2025 at 01:37 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21101v1

Overview

MortgageLLM tackles a common pain point for developers building AI products in regulated industries: how to give a large language model deep, domain‑specific expertise without sacrificing its ability to follow natural‑language instructions. By combining a clever residual‑instruction technique with a dual‑expert architecture, the authors turn a general‑purpose LLaMA‑3.1‑8B model into a specialist that excels at both conversational Q&A and structured tasks like classification and summarization in the mortgage finance space.

Key Contributions

Residual Instruction Transfer – a method that restores instruction‑following ability after heavy domain pre‑training, eliminating the need for costly supervised fine‑tuning.
Dual‑Expert Architecture – two specialist heads built from the same base model:
1. A conversational Q&A expert (optimized with Direct Preference Optimization, DPO).
2. A structured‑task expert (optimized with Supervised Fine‑Tuning, SFT) for classification and summarization.
Intelligent Task Routing – a few‑shot self‑classification step that automatically directs an incoming request to the appropriate expert, keeping the system end‑to‑end.
Domain‑Specific Benchmarks – new mortgage‑finance evaluation sets that expose the model to realistic loan‑approval documents, underwriting notes, and customer queries.

Methodology

Base Model Selection – Start with LLaMA‑3.1‑8B‑Instruct, a strong open‑source LLM that already understands instruction prompts.
Domain‑Adaptive Pre‑Training – Feed the model millions of mortgage‑related tokens (loan applications, rate tables, regulatory texts) to inject sector knowledge.
Residual Instruction Transfer – After domain pre‑training, the model’s instruction ability degrades. The authors compute the difference (residual) between the original instruction‑tuned weights and the domain‑adapted weights, then add this residual back. The result is a model that “remembers” how to follow instructions while retaining mortgage expertise.
Dual‑Track Specialization
- Conversational Expert: Trained with DPO on human‑rated dialogue data to maximize helpfulness and safety.
- Structured‑Task Expert: Trained with SFT on labeled classification and summarization datasets (e.g., “Is this loan eligible?”).
Task Routing Layer – When a user query arrives, the system runs a lightweight few‑shot classifier (implemented by the structured‑task expert) to decide whether the request is conversational or structured, then forwards it to the appropriate specialist.

All steps are performed on commodity GPU clusters (8‑A100 nodes), making the pipeline reproducible for most AI teams.

Results & Findings

Task	Metric (Higher = Better)	MortgageLLM v2	LLaMA‑3.1‑8B‑Instruct
Summarization (LLM‑as‑Judge)	Score	4.58	3.99
Q&A (LLM‑as‑Judge)	Score	4.09	4.00
Classification (LLM‑as‑Judge)	Score	2.60	1.20
Summarization (BERTScore)	0‑1	0.77	0.74
Q&A (BERTScore)	0‑1	0.68	0.58
Classification (BERTScore)	0‑1	0.75	0.73

The residual instruction step recovered ~95 % of the original instruction fidelity while adding domain knowledge.
The dual‑expert split avoided the “one‑size‑fits‑all” degradation seen when a single model is jointly fine‑tuned for both dialogue and structured tasks.
Task routing added < 10 ms latency overhead, preserving near‑real‑time responsiveness.

Practical Implications

Faster Time‑to‑Market for FinTech Apps – Teams can plug MortgageLLM into existing chat‑bots or document‑processing pipelines and instantly get higher accuracy on loan‑eligibility checks, risk summarizations, and customer support without building separate models.
Regulatory Compliance – Because the model is trained on actual mortgage regulations and can produce traceable classification outputs, auditors can more easily verify that AI‑generated advice aligns with legal requirements.
Cost‑Effective Scaling – The 8‑B parameter size keeps inference costs modest (≈ $0.0004 per 1 K tokens on typical GPU instances), making it viable for SaaS platforms that need to serve thousands of daily queries.
Reusable Blueprint – The residual instruction transfer and dual‑expert routing are domain‑agnostic. Developers in insurance, healthcare, or legal tech can adopt the same pipeline to create specialist LLMs without sacrificing conversational quality.

Limitations & Future Work

Data Coverage – The pre‑training corpus, while large, still misses niche mortgage products (e.g., reverse mortgages) that could affect edge‑case performance.
Model Size – An 8‑B backbone may hit limits on extremely long documents (e.g., full loan portfolios). Scaling to 30‑B or using retrieval‑augmented generation could address this.
Routing Accuracy – The few‑shot classifier occasionally misroutes ambiguous queries, leading to sub‑optimal responses; a more robust meta‑learner is a planned improvement.
Explainability – Current outputs lack built‑in rationale generation for regulatory audits; future work will integrate chain‑of‑thought prompting or post‑hoc attribution methods.

MortgageLLM demonstrates that with the right training tricks, you don’t have to choose between domain expertise and conversational polish. For developers building AI‑first products in regulated sectors, the paper offers a practical, reproducible recipe to get the best of both worlds.

Authors

Manish Jain
Satheesh Kumar Ponnambalam
Salman Faroz
Chandrakanth Lns
Vinay Sharma

Paper Information

arXiv ID: 2511.21101v1
Categories: cs.CL, cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation