[Paper] Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Published: 3 days ago (February 25, 2026 at 11:56 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22107v1

Overview

The paper Don’t stop me now: Rethinking Validation Criteria for Model Parameter Selection examines a surprisingly overlooked question: how the choice of validation metric influences the final test performance of neural classifiers. By systematically comparing accuracy‑based and loss‑based validation strategies—both with early stopping and with post‑hoc checkpoint selection—the authors reveal that the conventional practice of stopping training on validation accuracy can actually hurt generalisation.

Key Contributions

Empirical comparison of validation criteria (accuracy vs. three loss functions) across multiple datasets and model architectures.
Rigorous evaluation of early stopping vs. post‑hoc checkpoint selection under a $k$‑fold protocol, exposing systematic performance gaps.
Statistical evidence that loss‑based validation criteria are more stable and often outperform accuracy‑based early stopping.
Practical recommendation: avoid validation accuracy for model selection; prefer loss‑based metrics, especially when early stopping is used.

Methodology

Models & Datasets – Fully‑connected neural networks were trained on several standard classification benchmarks (e.g., MNIST, CIFAR‑10, etc.).
Training Objectives – Three loss functions were explored:
- Standard cross‑entropy (CE)
- C‑Loss (a calibrated variant)
- PolyLoss (a polynomial‑augmented CE)
Validation Strategies – For each training run the authors recorded every epoch’s checkpoint and then applied four selection rules:
- Early stopping with patience, using validation accuracy
- Early stopping with patience, using validation loss (each of the three loss types)
- Post‑hoc selection (no early stopping) based on validation accuracy
- Post‑hoc selection based on validation loss (each loss type)
Evaluation Protocol – A $k$‑fold cross‑validation scheme ensured that results were not dataset‑specific. Test accuracy of the selected checkpoint was compared against the oracle checkpoint (the one with highest test accuracy across all epochs).
Statistical Analysis – Paired tests and confidence intervals quantified the significance of observed differences.

Results & Findings

Validation rule	Typical test‑accuracy gap vs. oracle	Stability across folds
Early stopping on validation accuracy	Largest negative gap (up to ~2 % lower than oracle)	High variance
Early stopping on validation loss (any of the three)	Small, often negligible gap	Consistently low variance
Post‑hoc selection on validation loss	Comparable to loss‑based early stopping, sometimes slightly better	Very stable
Post‑hoc selection on validation accuracy	Better than accuracy‑based early stopping but still worse than loss‑based rules	Moderate variance

Key take‑aways:

Accuracy‑based early stopping consistently underperforms both loss‑based early stopping and any post‑hoc strategy.
Loss‑based validation criteria (CE, C‑Loss, PolyLoss) produce comparable, more reliable test accuracy regardless of whether early stopping is used.
No single validation rule matches the oracle across all folds; the best checkpoint is often missed, indicating that the validation set is an imperfect proxy for test performance.

Practical Implications

Training pipelines: Replace “stop when validation accuracy stops improving” with “stop when validation loss stops improving”. This simple change can shave off a few percent of test error without extra computation.
Model‑selection tooling: When building automated hyper‑parameter search or continuous training systems (e.g., MLOps platforms), prioritize loss‑based metrics for checkpoint selection and logging.
Early‑stopping hyper‑parameters: The study suggests that patience values matter less for loss‑based stopping, allowing more aggressive early stopping (shorter training) without sacrificing accuracy.
Ensembling & checkpoint averaging: Since the best test checkpoint is rarely captured by a single validation rule, developers might consider checkpoint ensembling (e.g., averaging weights of the top‑N loss‑based checkpoints) to hedge against the validation‑test mismatch.
Loss function design: The fact that three distinct loss functions behave similarly in validation suggests that the shape of the loss surface (smooth, calibrated) is more important than the exact formulation for model‑selection purposes.

Limitations & Future Work

Model scope – Experiments were limited to fully‑connected networks; convolutional or transformer architectures may exhibit different dynamics.
Dataset diversity – While several standard benchmarks were used, the study did not cover large‑scale vision or language corpora where validation‑set size and class imbalance could affect results.
Single‑metric selection – The authors evaluated each validation metric in isolation; combining accuracy and loss (e.g., multi‑objective early stopping) remains unexplored.
Theoretical grounding – The work is empirical; a formal analysis of why loss‑based criteria align better with test performance would strengthen the conclusions.

Bottom line: If you’re still using validation accuracy to decide when to stop training or which checkpoint to deploy, it’s time for a quick refactor. Switching to a loss‑based validation criterion can yield more stable, higher‑performing models with virtually no extra cost.

Authors

Andrea Apicella
Francesco Isgrò
Andrea Pollastro
Roberto Prevete

Paper Information

arXiv ID: 2602.22107v1
Categories: cs.LG, cs.AI
Published: February 25, 2026
PDF: Download PDF

[Paper] Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport