[Paper] G-Loss: Graph-Guided Fine-Tuning of Language Models

Published: (April 28, 2026 at 12:55 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.25853v1

Overview

The paper introduces G‑Loss, a new loss function for fine‑tuning large language models (LLMs) such as BERT. By weaving a graph that reflects global document‑level similarities into the training objective, G‑Loss helps models learn embeddings that respect the broader semantic landscape—something traditional losses (cross‑entropy, contrastive, etc.) overlook.

Key Contributions

  • Graph‑guided loss formulation that integrates semi‑supervised label propagation directly into the fine‑tuning objective.
  • Document‑similarity graph construction from the embedding space, capturing global semantic relations across the whole training corpus.
  • Empirical validation on five diverse text classification benchmarks (MR, R8, R52, Ohsumed, 20NG), showing faster convergence and higher accuracy versus standard loss functions.
  • Visualization and analysis of the learned embedding spaces, demonstrating improved semantic coherence and class separability.

Methodology

  1. Base Model – Start with a pre‑trained transformer (e.g., BERT) and obtain initial token/CLS embeddings for every document in the fine‑tuning set.
  2. Graph Construction – Compute pairwise cosine similarities between document embeddings and keep the top‑k nearest neighbors for each node, forming an undirected similarity graph (G = (V, E)).
  3. Label Propagation – Treat the available class labels as seeds and run a semi‑supervised propagation algorithm (e.g., personalized PageRank) on (G) to generate soft pseudo‑labels for unlabeled edges.
  4. G‑Loss Definition – Combine the standard supervised loss (cross‑entropy) with a graph‑regularization term that penalizes discrepancies between a node’s embedding and the propagated label distribution of its neighbors. Formally:

[ \mathcal{L}{\text{G‑Loss}} = \mathcal{L}{\text{sup}} + \lambda \sum_{(i,j)\in E} w_{ij}, \text{KL}\big(p_i ,|, p_j\big) ]

where (w_{ij}) are edge weights, (p_i) are the model’s predicted class distributions, and (\lambda) balances the two terms.

  1. Fine‑tuning Loop – Optimize the combined loss end‑to‑end; the graph is recomputed periodically (e.g., every epoch) to reflect the evolving embedding space.

Results & Findings

DatasetBaseline (Cross‑Entropy)G‑LossΔ AccuracyConvergence (epochs)
MR (sentiment)88.2 %90.5 %+2.3 %3 → 2
R8 (topic)94.1 %95.6 %+1.5 %4 → 2
R52 (topic)92.8 %94.3 %+1.5 %5 → 3
Ohsumed (medical)78.4 %81.0 %+2.6 %6 → 4
20NG (news)84.7 %86.9 %+2.2 %5 → 3
  • Faster convergence: G‑Loss typically reaches its peak performance 30–50 % earlier than the baseline.
  • Richer embeddings: t‑SNE visualizations show tighter intra‑class clusters and clearer inter‑class margins.
  • Robustness to label scarcity: When only 20 % of training labels are retained, G‑Loss degrades only ~1 % versus ~3 % for the baseline, highlighting the benefit of the graph’s semi‑supervised signal.

Practical Implications

  • Improved downstream classifiers: Developers can plug G‑Loss into existing fine‑tuning pipelines (PyTorch, Hugging Face Transformers) to boost accuracy on any text classification task without architectural changes.
  • Reduced training time: Faster convergence translates to lower GPU hours, which is attractive for production environments where model updates are frequent.
  • Better handling of noisy or sparse labels: The graph‑based regularization acts as a “semantic smoothing” layer, making models more tolerant to mislabeled data—a common pain point in real‑world corpora.
  • Potential for retrieval & clustering: Since G‑Loss yields embeddings that respect global similarity, the same fine‑tuned model can be reused for semantic search, duplicate detection, or topic clustering with minimal extra work.

Limitations & Future Work

  • Graph construction overhead: Building and updating the similarity graph can be costly for very large datasets; the authors suggest approximate nearest‑neighbor methods as a mitigation.
  • Hyper‑parameter sensitivity: The balance factor (\lambda) and the number of neighbors (k) need careful tuning; default values work well on the benchmarks but may need adjustment for domain‑specific data.
  • Scope limited to classification: Experiments focus on supervised classification; extending G‑Loss to generation‑oriented tasks (e.g., QA, summarization) remains an open question.

Overall, G‑Loss offers a pragmatic way to inject global semantic awareness into language model fine‑tuning, promising tangible gains for developers building robust NLP services.

Authors

  • Sharma Aditya
  • Agarwal Vinti
  • Kumar Rajesh

Paper Information

  • arXiv ID: 2604.25853v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...