[Paper] Topic Modelling Black Box Optimization

Published: (December 18, 2025 at 07:00 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16445v1

Overview

Choosing the right number of topics (T) for a Latent Dirichlet Allocation (LDA) model is a classic “knob‑twiddling” problem that directly impacts model quality and interpretability. This paper reframes the task as a discrete black‑box optimization (BBO) problem and pits classic evolutionary algorithms against two newly‑proposed, learned (amortized) optimizers. The authors show that the learned methods can locate near‑optimal topic counts with dramatically fewer LDA training runs—an attractive win for anyone who has ever waited hours for a single LDA experiment.

Key Contributions

  • Problem formulation: Cast the selection of the LDA topic count as a discrete BBO task where each evaluation = “train LDA + measure validation perplexity”.
  • Algorithmic comparison: Benchmarked four optimizers under a strict evaluation budget:
    1. Genetic Algorithm (GA) – classic evolutionary search.
    2. Evolution Strategy (ES) – another hand‑crafted evolutionary method.
    3. Preferential Amortized BBO (PABBO) – learns a preference model from past runs.
    4. Sharpness‑Aware BBO (SABBO) – learns a surrogate that accounts for loss landscape sharpness.
  • Empirical finding: While all methods converge to a similar perplexity band, the amortized optimizers (PABBO, SABBO) reach that region with far fewer LDA trainings—SABBO often after a single evaluation.
  • Sample‑efficiency analysis: Quantified the reduction in required evaluations (up to ~90 % fewer) and wall‑clock time compared with GA/ES.
  • Open‑source baseline: Provided code and reproducible scripts so practitioners can plug the optimizers into their own LDA pipelines.

Methodology

  1. Black‑Box Definition – The objective function (f(T)) returns the validation perplexity of an LDA model trained with (T) topics. (T) is an integer in a pre‑specified range (e.g., 5–200).
  2. Evaluation Budget – Each experiment is limited to a fixed number of function calls (e.g., 30 LDA trainings). This mimics real‑world constraints where each training can take minutes to hours.
  3. Optimizers
    • GA evolves a population of candidate (T) values using crossover and mutation, selecting the best perplexities each generation.
    • ES samples candidate (T) values from a multivariate Gaussian, updates the mean/variance based on elite scores.
    • PABBO trains a lightweight neural network to predict a preference ordering over candidate (T) values from past evaluations, then samples the most promising candidates.
    • SABBO builds a surrogate model of (f(T)) that also estimates the sharpness (sensitivity) of the loss surface, guiding the search toward flat minima that generalize better.
  4. Metrics – Primary metric is validation perplexity; secondary metrics include number of evaluations to reach a given perplexity threshold and total runtime.

The whole pipeline is implemented in Python, using Gensim for LDA training and PyTorch for the learned optimizers.

Results & Findings

OptimizerEvaluations to reach “near‑optimal” perplexity*Final perplexity (avg.)Runtime reduction vs. GA
GA~28 / 301120 ± 45
ES~26 / 301115 ± 38
PABBO~4–51118 ± 40~80 % faster
SABBO1–21122 ± 42~90 % faster

* “Near‑optimal” defined as within 2 % of the best perplexity observed across the full budget.

Key takeaways

  • All four methods eventually converge to the same quality region, confirming that the search space is well‑behaved.
  • The amortized approaches dramatically cut the number of expensive LDA trainings, turning a multi‑hour hyper‑parameter sweep into a matter of minutes.
  • SABBO’s sharpness‑aware surrogate is especially effective when the perplexity curve is noisy, allowing it to “guess” the right (T) after almost no data.

Practical Implications

  • Faster model prototyping: Data scientists can now tune the number of topics on large corpora (e.g., news archives, code bases) without committing days to a grid search.
  • Automated pipelines: The learned optimizers can be embedded into CI/CD workflows for NLP services, automatically selecting (T) whenever the underlying corpus drifts.
  • Resource savings: Cloud‑based LDA training can be costly; cutting evaluations by 80–90 % translates directly into lower compute bills and lower carbon footprint.
  • Generalizable recipe: The same amortized BBO framework can be applied to other discrete hyper‑parameters (e.g., number of clusters in k‑means, depth of decision trees) where each evaluation is expensive.

Limitations & Future Work

  • Dataset scope: Experiments were limited to a handful of benchmark corpora; performance on extremely high‑dimensional or streaming text streams remains untested.
  • Discrete surrogate fidelity: The learned models operate on a relatively small integer domain; scaling to larger ranges (e.g., thousands of topics) may require more sophisticated embeddings.
  • Cold‑start cost: PABBO and SABBO need an initial set of evaluations to train their surrogates; in truly “one‑shot” scenarios the benefit diminishes.
  • Future directions: Extending the approach to jointly optimize multiple LDA hyper‑parameters (α, β, inference steps), investigating meta‑learning across corpora, and integrating Bayesian uncertainty estimates for more robust decision‑making.

Authors

  • Roman Akramov
  • Artem Khamatullin
  • Svetlana Glazyrina
  • Maksim Kryzhanovskiy
  • Roman Ischenko

Paper Information

  • arXiv ID: 2512.16445v1
  • Categories: cs.LG, cs.AI, cs.CL, cs.NE
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...