[Paper] Model Agreement via Anchoring

Published: 3 days ago (February 26, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23360v1

Overview

The paper “Model Agreement via Anchoring” tackles a surprisingly practical problem: when we train two machine learning models on independent data, how often do they disagree? By treating disagreement as the expected squared difference between their predictions, the authors show that a simple analytical trick—anchoring the two models to their average—yields provable guarantees that disagreement can be driven to zero simply by scaling natural training parameters (e.g., number of boosted rounds, depth of a tree, or size of a neural‑net search space). The results apply to a range of widely‑used algorithms, offering a fresh lens on model stability and ensemble design.

Key Contributions

Anchoring Technique: Introduces a general proof method that bounds independent‑model disagreement by anchoring each model to the average of the pair.
Unified Theory Across Algorithms: Demonstrates how the same anchoring argument yields disagreement‑vanishing guarantees for:
1. Stacked aggregation (ensemble of arbitrary base learners) – disagreement → 0 as the number of stacked models k grows.
2. Gradient boosting – disagreement → 0 as the number of boosting iterations k increases.
3. Neural‑network architecture search – disagreement → 0 as the search space size n (e.g., number of hidden units or layers) expands.
4. Regression‑tree ensembles – disagreement → 0 as tree depth d grows.
Broad Applicability: While the core proofs are presented for 1‑D regression with squared loss, the authors extend the results to multi‑dimensional regression and any strongly convex loss (e.g., logistic loss).
Parameter‑Driven Control: Provides a clean, interpretable way to tune a single hyperparameter (stack size, boost rounds, architecture size, depth) to guarantee model agreement without coordinating the two training runs.

Methodology

Disagreement Metric:
- For two models f and g trained on independent samples, disagreement is defined as
  [ \mathbb{E}_{x}\big[(f(x)-g(x))^{2}\big]. ]
- This metric aligns with the usual squared‑error loss, making the analysis directly relevant to regression tasks.
Anchoring Argument:
- Define the anchor as the pointwise average (\bar{h}(x)=\frac{f(x)+g(x)}{2}).
- By convexity of the loss, the expected loss of each model can be related to the loss of the anchor plus a term that captures the deviation of each model from the anchor.
- The key insight: the deviation term can be bounded using properties of the learning algorithm (e.g., bias‑variance trade‑off, smoothness of the objective).
Algorithm‑Specific Instantiations:
- Stacked Aggregation: Treat the stack as a linear combination of base learners; the averaging effect of many learners shrinks the deviation term at a rate (O(1/k)).
- Gradient Boosting: Each iteration adds a weak learner that reduces the residual; the cumulative effect yields a geometric decay of disagreement with the number of rounds.
- Neural‑Net Architecture Search: By expanding the hypothesis class (more units/layers), the empirical risk minimizer gets closer to the anchor, driving disagreement down as (O(1/n)).
- Regression Trees: Deeper trees can approximate the anchor more finely; the bound scales with (O(2^{-d})) for fixed‑depth trees.
Extension to General Losses:
- The authors replace the squared loss with any strongly convex loss (\ell) and repeat the anchoring steps, leveraging strong convexity to retain the same decay rates.

Results & Findings

Algorithm	Controlling Parameter	Disagreement Decay
Stacked aggregation	Number of stacked models k	(\mathbb{E}[(f-g)^2] = O(1/k))
Gradient boosting	Boosting iterations k	(\mathbb{E}[(f-g)^2] = O(\rho^{k})) for some (\rho<1)
NN architecture search	Search space size n (e.g., width)	(\mathbb{E}[(f-g)^2] = O(1/n))
Regression trees	Tree depth d	(\mathbb{E}[(f-g)^2] = O(2^{-d}))

Interpretation: As we increase the natural hyperparameter, the two independently trained models become virtually indistinguishable in expectation.
Generality: The same asymptotic rates hold for multi‑dimensional regression and for losses such as logistic or hinge loss, provided they are strongly convex.

Practical Implications

Stable Ensembles Without Coordination – Developers can safely train multiple models in parallel (e.g., on different shards of data) and be confident that, by scaling the ensemble size or boosting rounds, the resulting predictors will converge to the same function. This reduces the need for explicit model synchronization or voting schemes.
Hyperparameter Guidance – The bounds give a quantitative target: if you need disagreement below a threshold (\epsilon), you can solve for the required k, d, or n directly from the decay formulas.
Robustness to Data Drift – In production, data pipelines often evolve. Knowing that disagreement shrinks with more expressive models suggests that periodically increasing model capacity can mitigate drift‑induced variance between successive deployments.
Simplified Model Auditing – When regulatory or safety constraints demand “model consistency,” the anchoring framework provides a provable way to certify that two independently trained versions of a system will not diverge beyond a pre‑specified bound.
Resource Allocation – The results help balance compute vs. stability: for gradient boosting, a modest increase in iteration count yields exponential decay, often cheaper than deepening trees or enlarging neural nets.

Limitations & Future Work

Assumption of Strong Convexity: The guarantees rely on strongly convex losses; extending to non‑convex objectives (e.g., modern deep learning with cross‑entropy) remains open.
Worst‑Case Bounds: The derived rates are asymptotic and may be loose for finite datasets; empirical calibration is needed to translate them into concrete hyperparameter choices.
Model Class Restrictions: While the paper covers several popular algorithms, it does not address unsupervised learning, reinforcement learning, or generative models, where disagreement notions differ.
Data Distribution Dependence: The analysis abstracts away the underlying data distribution; future work could incorporate distributional characteristics (e.g., heavy tails) to refine the bounds.

Overall, “Model Agreement via Anchoring” equips practitioners with a theoretically grounded, yet surprisingly simple, tool to control model disagreement across a suite of everyday machine‑learning pipelines.

Authors

Eric Eaton
Surbhi Goel
Marcel Hussing
Michael Kearns
Aaron Roth
Sikata Bela Sengupta
Jessica Sorrell

Paper Information

arXiv ID: 2602.23360v1
Categories: cs.LG, cs.AI
Published: February 26, 2026
PDF: Download PDF

[Paper] Model Agreement via Anchoring

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

[Paper] FlashOptim: Optimizers for Memory Efficient Training