[Paper] Evaluating Disentangled Representations for Controllable Music Generation
Source: arXiv
Source: arXiv:2602.10058v1
Overview
The paper “Evaluating Disentangled Representations for Controllable Music Generation” investigates how well current unsupervised models actually separate musical concepts such as structure vs. timbre (or local vs. global features). By probing the learned embeddings with a systematic framework, the authors expose a gap between the intended semantics of these representations and what they truly capture—an insight that matters for anyone building AI‑driven music tools that need fine‑grained control.
Key Contributions
Comprehensive probing framework that evaluates disentangled music embeddings along four axes:
- Informativeness – how much task‑relevant information is retained.
- Equivariance – whether controlled transformations (e.g., pitch shift) are reflected linearly in the latent space.
- Invariance – whether unrelated factors (e.g., timbre when probing structure) stay constant.
- Disentanglement – the degree to which each latent dimension isolates a single musical factor.
Benchmark of six state‑of‑the‑art unsupervised models employing diverse disentanglement strategies (inductive biases, data augmentations, adversarial losses, staged training).
Ablation studies that isolate the impact of individual design choices (e.g., adding a timbre‑adversarial head vs. using pitch‑preserving augmentations).
Cross‑dataset analysis on both synthetic (MIDI‑derived) and real‑world audio collections, demonstrating that results are not dataset‑specific.
Critical finding: many “disentangled” embeddings still entangle structure and timbre, limiting reliable controllability in downstream generation pipelines.
Methodology
Model selection – Choose a representative set of music‑audio encoders that claim to learn disentangled latent spaces.
- Variational autoencoders (VAEs) with structured priors
- Contrastive models using pitch‑preserving augmentations
- Adversarially regularized encoders
Probing pipeline – For each model:
- Freeze the encoder.
- Train lightweight linear probes (logistic regression or ridge regression) to predict a suite of musical attributes, e.g., beat position, chord progression, instrument class, pitch contour, etc.
Controlled transformations – Apply known audio manipulations to test:
- Equivariance – Does the latent representation move predictably with the transformation?
- Invariance – Does the unrelated sub‑space remain unchanged?
Examples of transformations: pitch shifting, time‑stretching, timbre substitution.
Metrics
Axis Metric Description Informativeness Probe accuracy / R² Compared to a fully supervised baseline Equivariance Correlation (transformation magnitude ↔ latent shift) Measures linear relationship Invariance Variance of the “other” sub‑space under transformation Lower variance = better invariance Disentanglement Mutual Information Gap (MIG) & SAP scores (audio‑adapted) Quantifies factor separation Ablations – Retrain selected models with/without specific components (e.g., removing the adversarial loss) to assess how each trick influences the four axes.
Results & Findings
| Model / Strategy | Informativeness | Equivariance (pitch) | Invariance (timbre) | Disentanglement (MIG) |
|---|---|---|---|---|
| VAE w/ structured prior | ★★★★ | ★★ | ★★ | ★★ |
| Contrastive + pitch‑aug | ★★★ | ★★★★ | ★★ | ★★ |
| Adversarial timbre‑scrub | ★★★ | ★★ | ★★★★ | ★ |
| Staged (pre‑train → fine‑tune) | ★★★★ | ★★★ | ★★★ | ★★ |
| Baseline (no disentanglement) | ★★★★ | ★★ | ★★ | ★ |
Key takeaways
- Informativeness is generally high across all models—latent spaces retain enough musical signal for downstream tasks.
- Equivariance is inconsistent; only contrastive models with explicit pitch‑preserving augmentations reliably map pitch shifts to linear latent changes.
- Invariance to timbre is rarely achieved; even adversarial approaches leave residual timbre leakage.
- Disentanglement scores are modest, indicating that most embeddings still mix structure and timbre.
- Ablation insights:
- Data augmentations contribute more to equivariance than adversarial losses.
- Inductive biases (e.g., hierarchical encoders) improve informativeness but do not substantially boost disentanglement.
Practical Implications
Tool builders (e.g., DAW plugins, generative‑music APIs) cannot assume that a “structure” latent will stay constant when swapping instruments, nor that a “timbre” latent will preserve rhythmic content. Extra conditioning or post‑hoc correction may be required.
Interactive music generation—where a user drags a slider to change “style” or “groove”—needs explicit verification that the underlying latent behaves as intended; otherwise, the UI may feel unpredictable.
Dataset curation matters: models trained on synthetic MIDI‑derived audio show slightly cleaner equivariance, suggesting that pre‑training on clean, well‑annotated data could be a practical shortcut before fine‑tuning on raw recordings.
Design guidelines: Incorporating pitch‑preserving augmentations and hierarchical encoder structures appears more beneficial than adversarial timbre scrubbing for achieving controllable latent spaces.
Evaluation pipelines: The probing framework itself can be integrated into CI pipelines for music‑generation models, giving developers a quantitative sanity check before releasing new versions.
Limitations & Future Work
Scope of transformations
- The study examines only pitch, time‑stretch, and timbre swaps.
- Other musical dimensions (e.g., articulation, dynamics) remain untested.
Audio quality not assessed
- Latent properties are measured, but downstream generation quality (e.g., perceptual realism) is inferred only indirectly.
Model diversity
- Only six models were evaluated.
- Newer diffusion‑based or transformer encoders could behave differently.
Human perception
- Probing metrics are statistical; a user study is needed to confirm whether the observed entanglements affect perceived controllability.
Future directions
- Explore joint disentanglement objectives that combine contrastive augmentations with adversarial regularization.
- Extend the framework to multimodal settings (e.g., audio‑score alignment) to capture richer musical semantics.
Authors
- Laura Ibáñez‑Martínez
- Chukwuemeka Nkama
- Andrea Poltronieri
- Xavier Serra
- Martín Rocamora
Paper Information
- arXiv ID:
2602.10058v1 - Categories:
cs.SD,cs.LG,eess.AS - Published: February 10, 2026
- PDF: Download PDF