[Paper] BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

Published: (June 3, 2026 at 01:48 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.05139v1

Overview

The paper introduces BBOmix, the first open‑source, tabular benchmark designed to evaluate hyperparameter optimization (HPO) strategies for unsupervised representation learning on real‑world biological data. By pooling thousands of Autoencoder (AE) runs across multiple omics modalities, the authors expose how reconstruction loss can mislead downstream performance and provide a solid test‑bed for next‑generation HPO algorithms.

Key Contributions

  • BBOmix benchmark: 105 k recorded AE training runs covering four architectures (vanilla AE, VAE, β‑VAE, and Denoising AE) and seven multi‑omics datasets (TCGA & SCHC). All data are released in a tidy, tabular format.
  • Correlation analysis: Systematic quantification of the relationship between unsupervised reconstruction loss and downstream task metrics (e.g., cancer subtype classification, survival prediction).
  • Comprehensive HPO evaluation: Baseline results for a wide spectrum of HPO methods—single‑fidelity (e.g., Random Search, Bayesian Optimization), multi‑fidelity (e.g., Hyperband, BOHB), and transfer‑learning approaches (e.g., Meta‑BO, warm‑started SMAC).
  • Open‑source tooling: Scripts for reproducing experiments, visualizing results, and extending the benchmark with new datasets or models.
  • Guidelines for practitioners: Empirical evidence on when reconstruction loss is a reliable proxy and when it fails, helping engineers choose appropriate validation strategies.

Methodology

  1. Data collection – The authors gathered seven high‑dimensional omics modalities (RNA‑seq, DNA‑methylation, copy‑number variation, etc.) from TCGA and the Swiss Cancer Cohort (SCHC). Each modality was pre‑processed (log‑transform, missing‑value imputation, standardization) to produce a consistent input space.
  2. Model space – Four AE families were instantiated with a configurable depth (1–4 hidden layers), latent dimensionality (8–256), activation functions, regularization strengths, and optimizer settings.
  3. Hyperparameter sampling – A quasi‑random Sobol sequence generated 105 k unique hyperparameter configurations, ensuring broad coverage of the search space.
  4. Training & evaluation – Each configuration was trained for a fixed budget (up to 200 epochs). Two metrics were recorded: (a) reconstruction loss (MSE or binary cross‑entropy, depending on data type) and (b) downstream performance measured by training a simple linear classifier on the learned latent vectors for tasks such as tumor type prediction.
  5. Benchmark construction – All runs were stored in a single CSV‑like table with columns for dataset, architecture, hyperparameters, training budget, reconstruction loss, and downstream scores.
  6. HPO experiments – The benchmark was then used as a surrogate: HPO algorithms queried the table instead of launching costly GPU jobs, allowing rapid comparison of search strategies under identical conditions.

Results & Findings

AspectWhat the authors observed
Reconstruction vs. downstreamPearson correlation ranged from 0.2 to 0.55 depending on modality and architecture—far from a reliable proxy. In some cases, the best downstream model had a higher reconstruction loss than many poorer configurations.
Single‑fidelity HPOBayesian Optimization (GP‑based) outperformed Random Search by ~15 % on average, but struggled on high‑dimensional hyperparameter spaces (e.g., when latent size and depth varied together).
Multi‑fidelity HPOMethods that exploit early‑stopping information (Hyperband, BOHB) achieved up to 30 % improvement in downstream performance for the same computational budget, confirming that cheap early evaluations are informative.
Transfer learning HPOWarm‑starting BO with meta‑features (e.g., dataset size, sparsity) reduced the number of trials needed to hit the top‑5% of configurations by ~40 % compared to cold‑start BO.
Architecture sensitivityDenoising AEs were the most robust to hyperparameter misspecification, while β‑VAEs showed the highest variance—making them ideal candidates for HPO research.

Overall, the benchmark establishes that naïve reliance on reconstruction loss can misguide model selection, and that multi‑fidelity and transfer‑learning HPO methods provide tangible gains for unsupervised biological representation learning.

Practical Implications

  • For ML engineers in biotech: When deploying unsupervised AEs for downstream analyses (e.g., patient stratification), allocate part of the budget to evaluate a cheap proxy (early‑stop loss) and a downstream validation set instead of trusting reconstruction loss alone.
  • Tooling integration: BBOmix’s tabular format can be plugged into existing HPO platforms (Optuna, Ray Tune, Nevergrad) as a “mock” objective, enabling rapid prototyping of new search algorithms without GPU clusters.
  • AutoML pipelines: The findings encourage the inclusion of multi‑fidelity loops (e.g., Hyperband) and meta‑learning components when building AutoML services for omics data, potentially cutting training time by half.
  • Benchmark‑driven research: Start‑ups working on novel unsupervised architectures (e.g., contrastive or diffusion‑based models) can benchmark against BBOmix to demonstrate superior downstream utility, rather than just lower reconstruction error.
  • Regulatory & reproducibility: Because every run is logged with full hyperparameter provenance, BBOmix aids compliance with FAIR data principles and makes it easier to audit model choices in clinical settings.

Limitations & Future Work

  • Scope of downstream tasks – The benchmark focuses on linear classifiers for a few cancer‑type prediction tasks; more complex downstream pipelines (e.g., survival analysis, multi‑task learning) remain untested.
  • Static dataset splits – All evaluations use a single train/validation/test split per modality, which may underestimate variability across cohorts.
  • Model diversity – Only four AE families are covered; emerging unsupervised paradigms such as contrastive learning, normalizing flows, or graph‑based encoders are absent.
  • Hardware realism – Since the benchmark is a surrogate table, it does not capture real‑world GPU memory constraints or runtime variations that can affect HPO decisions.

Future directions suggested by the authors include expanding BBOmix with contrastive and transformer‑based encoders, adding multi‑task downstream evaluations, and integrating runtime and memory metrics to enable cost‑aware HPO research.

Authors

  • Luca Thale-Bombien
  • Jan Ewald
  • Ralf König
  • Aaron Klein

Paper Information

  • arXiv ID: 2606.05139v1
  • Categories: cs.LG
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »