[Paper] Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection
Source: arXiv - 2602.22107v1
Overview
The paper Don’t stop me now: Rethinking Validation Criteria for Model Parameter Selection examines a surprisingly overlooked question: how the choice of validation metric influences the final test performance of neural classifiers. By systematically comparing accuracy‑based and loss‑based validation strategies—both with early stopping and with post‑hoc checkpoint selection—the authors reveal that the conventional practice of stopping training on validation accuracy can actually hurt generalisation.
Key Contributions
- Empirical comparison of validation criteria (accuracy vs. three loss functions) across multiple datasets and model architectures.
- Rigorous evaluation of early stopping vs. post‑hoc checkpoint selection under a $k$‑fold protocol, exposing systematic performance gaps.
- Statistical evidence that loss‑based validation criteria are more stable and often outperform accuracy‑based early stopping.
- Practical recommendation: avoid validation accuracy for model selection; prefer loss‑based metrics, especially when early stopping is used.
Methodology
- Models & Datasets – Fully‑connected neural networks were trained on several standard classification benchmarks (e.g., MNIST, CIFAR‑10, etc.).
- Training Objectives – Three loss functions were explored:
- Standard cross‑entropy (CE)
- C‑Loss (a calibrated variant)
- PolyLoss (a polynomial‑augmented CE)
- Validation Strategies – For each training run the authors recorded every epoch’s checkpoint and then applied four selection rules:
- Early stopping with patience, using validation accuracy
- Early stopping with patience, using validation loss (each of the three loss types)
- Post‑hoc selection (no early stopping) based on validation accuracy
- Post‑hoc selection based on validation loss (each loss type)
- Evaluation Protocol – A $k$‑fold cross‑validation scheme ensured that results were not dataset‑specific. Test accuracy of the selected checkpoint was compared against the oracle checkpoint (the one with highest test accuracy across all epochs).
- Statistical Analysis – Paired tests and confidence intervals quantified the significance of observed differences.
Results & Findings
| Validation rule | Typical test‑accuracy gap vs. oracle | Stability across folds |
|---|---|---|
| Early stopping on validation accuracy | Largest negative gap (up to ~2 % lower than oracle) | High variance |
| Early stopping on validation loss (any of the three) | Small, often negligible gap | Consistently low variance |
| Post‑hoc selection on validation loss | Comparable to loss‑based early stopping, sometimes slightly better | Very stable |
| Post‑hoc selection on validation accuracy | Better than accuracy‑based early stopping but still worse than loss‑based rules | Moderate variance |
Key take‑aways:
- Accuracy‑based early stopping consistently underperforms both loss‑based early stopping and any post‑hoc strategy.
- Loss‑based validation criteria (CE, C‑Loss, PolyLoss) produce comparable, more reliable test accuracy regardless of whether early stopping is used.
- No single validation rule matches the oracle across all folds; the best checkpoint is often missed, indicating that the validation set is an imperfect proxy for test performance.
Practical Implications
- Training pipelines: Replace “stop when validation accuracy stops improving” with “stop when validation loss stops improving”. This simple change can shave off a few percent of test error without extra computation.
- Model‑selection tooling: When building automated hyper‑parameter search or continuous training systems (e.g., MLOps platforms), prioritize loss‑based metrics for checkpoint selection and logging.
- Early‑stopping hyper‑parameters: The study suggests that patience values matter less for loss‑based stopping, allowing more aggressive early stopping (shorter training) without sacrificing accuracy.
- Ensembling & checkpoint averaging: Since the best test checkpoint is rarely captured by a single validation rule, developers might consider checkpoint ensembling (e.g., averaging weights of the top‑N loss‑based checkpoints) to hedge against the validation‑test mismatch.
- Loss function design: The fact that three distinct loss functions behave similarly in validation suggests that the shape of the loss surface (smooth, calibrated) is more important than the exact formulation for model‑selection purposes.
Limitations & Future Work
- Model scope – Experiments were limited to fully‑connected networks; convolutional or transformer architectures may exhibit different dynamics.
- Dataset diversity – While several standard benchmarks were used, the study did not cover large‑scale vision or language corpora where validation‑set size and class imbalance could affect results.
- Single‑metric selection – The authors evaluated each validation metric in isolation; combining accuracy and loss (e.g., multi‑objective early stopping) remains unexplored.
- Theoretical grounding – The work is empirical; a formal analysis of why loss‑based criteria align better with test performance would strengthen the conclusions.
Bottom line: If you’re still using validation accuracy to decide when to stop training or which checkpoint to deploy, it’s time for a quick refactor. Switching to a loss‑based validation criterion can yield more stable, higher‑performing models with virtually no extra cost.
Authors
- Andrea Apicella
- Francesco Isgrò
- Andrea Pollastro
- Roberto Prevete
Paper Information
- arXiv ID: 2602.22107v1
- Categories: cs.LG, cs.AI
- Published: February 25, 2026
- PDF: Download PDF