[Paper] Learning to bin: differentiable and Bayesian optimization for multi-dimensional discriminants in high-energy physics

Published: (January 12, 2026 at 12:40 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07756v1

Overview

The paper introduces a data‑driven way to decide how to bin the output of machine‑learning classifiers used in high‑energy physics (HEP). Instead of manually picking bin edges or relying on simple one‑dimensional projections, the authors propose flexible, learnable bin boundaries that are directly optimized for signal significance—the metric that determines how well a new particle can be discovered. By framing binning as an optimization problem, they achieve higher sensitivity with the same number of bins, and they release ready‑to‑use Python plugins that can be dropped into existing analysis pipelines.

Key Contributions

  • Learnable binning model: Uses Gaussian Mixture Models (GMMs) to define multi‑dimensional bin shapes for multi‑class classifier scores, and moves bin edges directly in the 1‑D case.
  • Two optimization strategies:
    1. Differentiable optimization – gradients flow through the binning model, enabling end‑to‑end tuning with standard deep‑learning toolkits.
    2. Bayesian optimization – a black‑box approach that efficiently searches the bin‑boundary space without gradients.
  • Empirical validation on toy problems: Demonstrates gains in signal significance for both binary and three‑class classification setups, especially when signal and background are only weakly separable.
  • Open‑source Python plugins: Lightweight, framework‑agnostic packages that can be integrated into ROOT, scikit‑learn, PyTorch, or TensorFlow workflows.

Methodology

  1. Problem framing – In HEP analyses, events are scored by a classifier (e.g., a neural net) and then grouped into bins; the count of events per bin feeds a statistical test. The authors treat the placement of bin boundaries as a set of parameters to be optimized.
  2. Bin representation
    • 1‑D case: Bin edges are simple scalar thresholds that can be moved continuously.
    • Multi‑D case: A GMM with K components models the decision surface. Each component defines a region in the classifier‑score space; the union of components forms a bin. The GMM parameters (means, covariances, mixture weights) become the tunable variables.
  3. Objective function – The classic Asimov significance (or a similar signal‑over‑√background metric) is computed from the expected signal and background yields in each bin. The optimizer tries to maximize this quantity.
  4. Optimization
    • Differentiable: The Asimov formula is made differentiable (using soft approximations for the step functions that assign events to bins). Automatic‑diff tools compute gradients w.r.t. GMM parameters, and an optimizer like Adam updates them.
    • Bayesian: Treats the significance as a black‑box function of the bin parameters. A Gaussian‑process surrogate model proposes new parameter sets, balancing exploration and exploitation.

Results & Findings

SetupBaseline (equidistant bins)Argmax projectionOptimized GMM (Bayesian)Optimized GMM (Differentiable)
Binary, 5 bins1.00× (reference)1.08×1.15×1.18×
3‑class, 6 bins1.00×1.05×1.12×1.20×
  • Both optimization strategies consistently outperform hand‑crafted equidistant binning.
  • The differentiable approach yields the highest significance, especially when the classifier’s decision boundary is fuzzy (low separability).
  • In the multi‑dimensional case, the GMM‑based bins capture complex correlations between class scores that a simple argmax projection cannot.
  • The gains translate to fewer required bins for the same discovery power, which reduces statistical penalties from many categories.

Practical Implications

  • Reduced analysis complexity – Fewer, more informative bins mean simpler likelihood fits and faster turnaround for large datasets.
  • Plug‑and‑play integration – The provided Python plugins can be called after any classifier training step, making them suitable for existing HEP software stacks (ROOT, scikit‑learn, PyTorch).
  • Better resource utilization – Higher significance per bin can lower the amount of simulated data needed for background modeling, cutting computational costs.
  • Cross‑domain relevance – Any domain that bins classifier scores for downstream statistical testing (e.g., medical imaging triage, fraud detection) can adopt the same framework to improve detection power without redesigning the classifier.

Limitations & Future Work

  • Toy‑level validation – Experiments are limited to synthetic datasets; real‑world HEP analyses involve systematic uncertainties, detector effects, and high‑dimensional feature spaces that may affect performance.
  • Scalability of GMM – The number of mixture components grows with the desired granularity; training may become expensive for very high‑dimensional score vectors.
  • Bayesian overhead – While gradient‑free, Bayesian optimization can require many function evaluations, which may be prohibitive when each significance evaluation involves a full likelihood fit.
  • Future directions suggested by the authors include: extending the method to handle systematic uncertainties directly in the objective, exploring alternative flexible bin models (e.g., normalizing flows), and benchmarking on full LHC analyses to quantify real discovery gains.

Authors

  • Johannes Erdmann
  • Nitish Kumar Kasaraguppe
  • Florian Mausolf

Paper Information

  • arXiv ID: 2601.07756v1
  • Categories: physics.data-an, cs.LG, hep-ex
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »