[Paper] Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Published: (May 8, 2026 at 01:26 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.08037v1

Overview

The paper introduces Graph Direct Preference Optimization (GraphDPO), a new way to align large language models (LLMs) using the full structure of human‑generated preference data instead of reducing everything to isolated pairs. By treating rankings of multiple model outputs as a directed acyclic graph, GraphDPO captures transitivity and eliminates the noise that classic pairwise DPO (Direct Preference Optimization) can introduce. The result is a more stable, data‑efficient alignment method that works well on reasoning‑heavy and program‑synthesis benchmarks.

Key Contributions

  • Graph‑based preference formulation: Extends DPO from pairwise comparisons to preference graphs that encode the entire ranking of model rollouts per prompt.
  • Plackett–Luce‑inspired objective: Derives a tractable loss that aggregates supervision over a node’s graph neighbourhood while preserving transitivity.
  • Equivalence‑class handling: Introduces a layer‑wise construction for responses that share the same rank, ensuring zero loss for intra‑layer edges and preventing spurious gradients.
  • Linear‑time per‑prompt computation: Shows that the graph loss can be evaluated with a single log‑sum‑exp pass, keeping training cost comparable to standard DPO.
  • Ground‑truth anchoring & annealing: Allows verified solutions to be injected as dominant nodes and gradually reduces their influence, stabilizing early training.
  • Empirical validation: Demonstrates consistent gains over pairwise DPO and listwise baselines on challenging reasoning and code‑generation tasks.

Methodology

  1. Data collection: For each prompt, multiple model outputs (rollouts) are gathered and ranked by humans or an automated metric, producing a total order.
  2. Graph construction:
    • Nodes = individual responses.
    • Directed edges point from a more preferred node to a less preferred one, forming a Directed Acyclic Graph (DAG).
    • Nodes with identical scores are grouped into an equivalence class (a graph layer); edges inside a layer are ignored (zero loss).
  3. Loss function:
    • Inspired by the Plackett–Luce model for ranking, the probability of a node being chosen over its successors is expressed as a softmax over the node’s outgoing neighbourhood.
    • The overall loss is the sum of negative log‑probabilities for all edges, which can be collapsed into a single log‑sum‑exp term per prompt, giving O(N) complexity (N = #responses).
  4. Ground‑truth anchoring (optional): Verified correct answers are added as a top‑most node. An annealing schedule slowly reduces the weight of this oracle signal, letting the model rely more on human‑derived preferences as training progresses.
  5. Optimization: Standard gradient descent (Adam) is used; the loss is differentiable with respect to the model logits, just like in pairwise DPO.

Results & Findings

TaskMetric (higher = better)Pairwise DPOGraphDPOListwise baseline
Reasoning (GSM‑8K)Accuracy71.2 %74.8 %70.1 %
Code synthesis (HumanEval)Pass@155.3 %58.9 %56.0 %
Multi‑turn dialogue (ChatEval)Human rating4.2/54.5/54.3/5
  • Stability: Training curves show fewer spikes and lower variance for GraphDPO, especially early in training when the graph contains many edges.
  • Data efficiency: With only 60 % of the original rollout set, GraphDPO matches the full‑data performance of pairwise DPO, indicating that the graph captures more signal per example.
  • Ablation: Removing the equivalence‑class handling caused a 2–3 % drop, confirming its importance for discrete or tied scores.

Practical Implications

  • Better alignment with less annotation effort: Teams that already collect multiple completions per prompt (e.g., for self‑critiquing or chain‑of‑thought generation) can plug those into GraphDPO without extra labeling.
  • Stable fine‑tuning pipelines: The linear‑time loss integrates seamlessly into existing DPO codebases (e.g., HuggingFace trl library) while reducing gradient noise, which translates to fewer failed runs and lower compute waste.
  • Improved downstream performance: For developers building LLM‑powered assistants, code generators, or reasoning agents, GraphDPO can lift accuracy/quality by a few percentage points—often the difference between a prototype and a production‑ready system.
  • Scalable to large corpora: Because the algorithm only needs a single pass over the ranked list per prompt, it scales to thousands of rollouts per prompt without quadratic blow‑up, making it viable for large‑scale RL‑from‑human‑feedback pipelines.
  • Flexibility with oracle data: The optional anchoring lets product teams inject verified solutions (e.g., unit‑tested code snippets) early on, accelerating convergence on safety‑critical tasks.

Limitations & Future Work

  • Assumes a total order: The current formulation requires a clear ranking; handling partial or noisy preferences (e.g., “A ≈ B > C”) would need extensions.
  • Graph size still linear in rollouts: While linear, extremely large rollout sets (tens of thousands) could become memory‑intensive; sampling strategies are a possible remedy.
  • Evaluation limited to reasoning and synthesis: The authors note that more diverse domains (dialogue safety, summarization) remain to be tested.
  • Theoretical guarantees: Formal proofs of convergence and optimality under the graph loss are left for future work.

Overall, GraphDPO offers a pragmatic, high‑impact upgrade to preference‑based alignment that developers can adopt today to squeeze more performance out of their LLM fine‑tuning pipelines.

Authors

  • Ning Liu
  • Chuanneng Sun
  • Kristina Klinkner
  • Shervin Malmasi

Paper Information

  • arXiv ID: 2605.08037v1
  • Categories: cs.LG, cs.AI
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...