[Paper] Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Published: (February 25, 2026 at 11:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22107v1

Overview

The paper Don’t stop me now: Rethinking Validation Criteria for Model Parameter Selection examines a surprisingly overlooked question: how the choice of validation metric influences the final test performance of neural classifiers. By systematically comparing accuracy‑based and loss‑based validation strategies—both with early stopping and with post‑hoc checkpoint selection—the authors reveal that the conventional practice of stopping training on validation accuracy can actually hurt generalisation.

Key Contributions

  • Empirical comparison of validation criteria (accuracy vs. three loss functions) across multiple datasets and model architectures.
  • Rigorous evaluation of early stopping vs. post‑hoc checkpoint selection under a $k$‑fold protocol, exposing systematic performance gaps.
  • Statistical evidence that loss‑based validation criteria are more stable and often outperform accuracy‑based early stopping.
  • Practical recommendation: avoid validation accuracy for model selection; prefer loss‑based metrics, especially when early stopping is used.

Methodology

  1. Models & Datasets – Fully‑connected neural networks were trained on several standard classification benchmarks (e.g., MNIST, CIFAR‑10, etc.).
  2. Training Objectives – Three loss functions were explored:
    • Standard cross‑entropy (CE)
    • C‑Loss (a calibrated variant)
    • PolyLoss (a polynomial‑augmented CE)
  3. Validation Strategies – For each training run the authors recorded every epoch’s checkpoint and then applied four selection rules:
    • Early stopping with patience, using validation accuracy
    • Early stopping with patience, using validation loss (each of the three loss types)
    • Post‑hoc selection (no early stopping) based on validation accuracy
    • Post‑hoc selection based on validation loss (each loss type)
  4. Evaluation Protocol – A $k$‑fold cross‑validation scheme ensured that results were not dataset‑specific. Test accuracy of the selected checkpoint was compared against the oracle checkpoint (the one with highest test accuracy across all epochs).
  5. Statistical Analysis – Paired tests and confidence intervals quantified the significance of observed differences.

Results & Findings

Validation ruleTypical test‑accuracy gap vs. oracleStability across folds
Early stopping on validation accuracyLargest negative gap (up to ~2 % lower than oracle)High variance
Early stopping on validation loss (any of the three)Small, often negligible gapConsistently low variance
Post‑hoc selection on validation lossComparable to loss‑based early stopping, sometimes slightly betterVery stable
Post‑hoc selection on validation accuracyBetter than accuracy‑based early stopping but still worse than loss‑based rulesModerate variance

Key take‑aways:

  1. Accuracy‑based early stopping consistently underperforms both loss‑based early stopping and any post‑hoc strategy.
  2. Loss‑based validation criteria (CE, C‑Loss, PolyLoss) produce comparable, more reliable test accuracy regardless of whether early stopping is used.
  3. No single validation rule matches the oracle across all folds; the best checkpoint is often missed, indicating that the validation set is an imperfect proxy for test performance.

Practical Implications

  • Training pipelines: Replace “stop when validation accuracy stops improving” with “stop when validation loss stops improving”. This simple change can shave off a few percent of test error without extra computation.
  • Model‑selection tooling: When building automated hyper‑parameter search or continuous training systems (e.g., MLOps platforms), prioritize loss‑based metrics for checkpoint selection and logging.
  • Early‑stopping hyper‑parameters: The study suggests that patience values matter less for loss‑based stopping, allowing more aggressive early stopping (shorter training) without sacrificing accuracy.
  • Ensembling & checkpoint averaging: Since the best test checkpoint is rarely captured by a single validation rule, developers might consider checkpoint ensembling (e.g., averaging weights of the top‑N loss‑based checkpoints) to hedge against the validation‑test mismatch.
  • Loss function design: The fact that three distinct loss functions behave similarly in validation suggests that the shape of the loss surface (smooth, calibrated) is more important than the exact formulation for model‑selection purposes.

Limitations & Future Work

  • Model scope – Experiments were limited to fully‑connected networks; convolutional or transformer architectures may exhibit different dynamics.
  • Dataset diversity – While several standard benchmarks were used, the study did not cover large‑scale vision or language corpora where validation‑set size and class imbalance could affect results.
  • Single‑metric selection – The authors evaluated each validation metric in isolation; combining accuracy and loss (e.g., multi‑objective early stopping) remains unexplored.
  • Theoretical grounding – The work is empirical; a formal analysis of why loss‑based criteria align better with test performance would strengthen the conclusions.

Bottom line: If you’re still using validation accuracy to decide when to stop training or which checkpoint to deploy, it’s time for a quick refactor. Switching to a loss‑based validation criterion can yield more stable, higher‑performing models with virtually no extra cost.

Authors

  • Andrea Apicella
  • Francesco Isgrò
  • Andrea Pollastro
  • Roberto Prevete

Paper Information

  • arXiv ID: 2602.22107v1
  • Categories: cs.LG, cs.AI
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...