[Paper] Exploring Definitions of Quality and Diversity in Sonic Measurement Spaces
Source: arXiv - 2512.02783v1
Overview
The paper investigates how to let evolutionary algorithms automatically discover a wide variety of high‑quality sounds without relying on hand‑crafted audio descriptors or supervised classifiers. By using unsupervised dimensionality‑reduction (PCA and autoencoders) to build and continuously reshape the “behaviour space” that guides Quality‑Diversity (QD) search, the authors show that a system can explore far richer sonic territories while staying unbiased toward any pre‑selected sound families.
Key Contributions
- Unsupervised behaviour‑space construction: Demonstrates that PCA and deep autoencoders can turn raw audio feature vectors into compact, structured maps suitable for MAP‑Elites without any human‑defined descriptors.
- Dynamic reconfiguration: Introduces a simple schedule that periodically retrains the dimensionality‑reduction model, keeping the behaviour space aligned with the evolving population and preventing premature convergence.
- Empirical comparison: Benchmarks handcrafted, static behaviour spaces against the proposed automatic approaches across two distinct synthesis scenarios, showing a statistically significant boost in diversity.
- Practical recommendation: Finds that linear PCA, despite its simplicity, outperforms deeper autoencoders in this context, offering a low‑cost, high‑impact tool for sound‑design pipelines.
Methodology
- Synthesis environment: A digital sound synthesizer with millions of parameter combinations serves as the search domain.
- Feature extraction: For each generated sound, a high‑dimensional vector of standard audio descriptors (spectral, temporal, etc.) is computed.
- Dimensionality reduction:
- PCA – computes the top‑k orthogonal axes that capture most variance.
- Autoencoder – a shallow neural network learns a non‑linear bottleneck representation.
- Behaviour space creation: The reduced vectors are discretised into a fixed‑size grid (the MAP‑Elites archive). Each cell stores the highest‑quality sound that falls into that region.
- Dynamic update: Every N generations the reduction model is retrained on the current elite set, redefining the grid boundaries and thus “re‑shaping” the exploration landscape.
- Evaluation: Two experimental setups (different synth architectures) are run with three behaviour‑space strategies: handcrafted descriptors, static PCA, and dynamic PCA/autoencoder. Diversity (coverage of the grid) and quality (objective fitness) are logged.
Results & Findings
| Strategy | Grid Coverage (Diversity) | Avg. Quality | Notes |
|---|---|---|---|
| Handcrafted descriptors | ~45 % | High | Limited to designer‑chosen dimensions; many cells never visited. |
| Static PCA (k=10) | ~68 % | Comparable | Linear reduction captures most variance, enabling broader exploration. |
| Dynamic PCA (re‑train every 200 gen) | ~78 % | Slightly higher | Continual reshaping sustains evolutionary pressure, avoids stagnation. |
| Static Autoencoder | ~62 % | Slightly lower | Non‑linear mapping adds complexity but does not beat PCA here. |
| Dynamic Autoencoder | ~70 % | Similar to static PCA | Over‑fitting risk; benefits offset by extra training cost. |
Takeaway: Automatic, unsupervised behaviour spaces dramatically increase the number of distinct sonic niches discovered, and a simple periodic retraining (dynamic PCA) yields the best trade‑off between diversity, quality, and computational overhead.
Practical Implications
- Plug‑and‑play sound‑design tools: Developers can embed a PCA‑based MAP‑Elites module into DAWs, game audio engines, or procedural music generators without needing domain experts to define feature sets.
- Scalable exploration: Because PCA is computationally cheap, the approach scales to millions of synth configurations, making it viable for cloud‑based sound‑banks or on‑device synthesis on modern GPUs/NPUs.
- Bias‑free content creation: Removing handcrafted descriptors eliminates hidden aesthetic biases, allowing AI‑driven composers to uncover truly novel timbres that might be overlooked by human designers.
- Rapid prototyping: The dynamic reconfiguration loop can be exposed as a UI knob (“exploration refresh”) for artists, giving them control over how aggressively the system seeks new sonic territories.
Limitations & Future Work
- Feature dependence: The method still relies on an initial set of low‑level audio descriptors; if these miss perceptually relevant cues, the reduced space may be sub‑optimal.
- Retraining schedule: The paper uses a fixed interval for model updates; adaptive schedules (e.g., triggered by stagnation metrics) could improve efficiency.
- Autoencoder depth: Only shallow autoencoders were tested; deeper or variational models might capture richer non‑linear relationships but require careful regularisation.
- Real‑time constraints: While PCA is fast, autoencoder retraining can be costly for on‑the‑fly applications; future work could explore incremental learning or lightweight neural architectures.
By automating the definition and evolution of sonic behaviour spaces, this research opens the door to more autonomous, diverse, and unbiased sound‑generation systems—an exciting prospect for developers building the next generation of interactive audio experiences.
Authors
- Björn Þór Jónsson
- Çağrı Erdem
- Stefano Fasciani
- Kyrre Glette
Paper Information
- arXiv ID: 2512.02783v1
- Categories: cs.SD, cs.NE
- Published: December 2, 2025
- PDF: Download PDF