[Paper] Evaluating Disentangled Representations for Controllable Music Generation

Published: 2 months ago (February 10, 2026 at 01:25 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper “Evaluating Disentangled Representations for Controllable Music Generation” investigates how well current unsupervised models actually separate musical concepts such as structure vs. timbre (or local vs. global features). By probing the learned embeddings with a systematic framework, the authors expose a gap between the intended semantics of these representations and what they truly capture—an insight that matters for anyone building AI‑driven music tools that need fine‑grained control.

Key Contributions

Comprehensive probing framework that evaluates disentangled music embeddings along four axes:
1. Informativeness – how much task‑relevant information is retained.
2. Equivariance – whether controlled transformations (e.g., pitch shift) are reflected linearly in the latent space.
3. Invariance – whether unrelated factors (e.g., timbre when probing structure) stay constant.
4. Disentanglement – the degree to which each latent dimension isolates a single musical factor.
Benchmark of six state‑of‑the‑art unsupervised models employing diverse disentanglement strategies (inductive biases, data augmentations, adversarial losses, staged training).
Ablation studies that isolate the impact of individual design choices (e.g., adding a timbre‑adversarial head vs. using pitch‑preserving augmentations).
Cross‑dataset analysis on both synthetic (MIDI‑derived) and real‑world audio collections, demonstrating that results are not dataset‑specific.
Critical finding: many “disentangled” embeddings still entangle structure and timbre, limiting reliable controllability in downstream generation pipelines.

Methodology

Model selection – Choose a representative set of music‑audio encoders that claim to learn disentangled latent spaces.
- Variational autoencoders (VAEs) with structured priors
- Contrastive models using pitch‑preserving augmentations
- Adversarially regularized encoders
Probing pipeline – For each model:
- Freeze the encoder.
- Train lightweight linear probes (logistic regression or ridge regression) to predict a suite of musical attributes, e.g., beat position, chord progression, instrument class, pitch contour, etc.
Controlled transformations – Apply known audio manipulations to test:
- Equivariance – Does the latent representation move predictably with the transformation?
- Invariance – Does the unrelated sub‑space remain unchanged?
Examples of transformations: pitch shifting, time‑stretching, timbre substitution.

Metrics

Axis	Metric	Description
Informativeness	Probe accuracy / R²	Compared to a fully supervised baseline
Equivariance	Correlation (transformation magnitude ↔ latent shift)	Measures linear relationship
Invariance	Variance of the “other” sub‑space under transformation	Lower variance = better invariance
Disentanglement	Mutual Information Gap (MIG) & SAP scores (audio‑adapted)	Quantifies factor separation

Ablations – Retrain selected models with/without specific components (e.g., removing the adversarial loss) to assess how each trick influences the four axes.

Results & Findings

Model / Strategy	Informativeness	Equivariance (pitch)	Invariance (timbre)	Disentanglement (MIG)
VAE w/ structured prior	★★★★	★★	★★	★★
Contrastive + pitch‑aug	★★★	★★★★	★★	★★
Adversarial timbre‑scrub	★★★	★★	★★★★	★
Staged (pre‑train → fine‑tune)	★★★★	★★★	★★★	★★
Baseline (no disentanglement)	★★★★	★★	★★	★

Key takeaways

Informativeness is generally high across all models—latent spaces retain enough musical signal for downstream tasks.
Equivariance is inconsistent; only contrastive models with explicit pitch‑preserving augmentations reliably map pitch shifts to linear latent changes.
Invariance to timbre is rarely achieved; even adversarial approaches leave residual timbre leakage.
Disentanglement scores are modest, indicating that most embeddings still mix structure and timbre.
Ablation insights:
- Data augmentations contribute more to equivariance than adversarial losses.
- Inductive biases (e.g., hierarchical encoders) improve informativeness but do not substantially boost disentanglement.

Practical Implications

Tool builders (e.g., DAW plugins, generative‑music APIs) cannot assume that a “structure” latent will stay constant when swapping instruments, nor that a “timbre” latent will preserve rhythmic content. Extra conditioning or post‑hoc correction may be required.
Interactive music generation—where a user drags a slider to change “style” or “groove”—needs explicit verification that the underlying latent behaves as intended; otherwise, the UI may feel unpredictable.
Dataset curation matters: models trained on synthetic MIDI‑derived audio show slightly cleaner equivariance, suggesting that pre‑training on clean, well‑annotated data could be a practical shortcut before fine‑tuning on raw recordings.
Design guidelines: Incorporating pitch‑preserving augmentations and hierarchical encoder structures appears more beneficial than adversarial timbre scrubbing for achieving controllable latent spaces.
Evaluation pipelines: The probing framework itself can be integrated into CI pipelines for music‑generation models, giving developers a quantitative sanity check before releasing new versions.

Limitations & Future Work

Scope of transformations
- The study examines only pitch, time‑stretch, and timbre swaps.
- Other musical dimensions (e.g., articulation, dynamics) remain untested.
Audio quality not assessed
- Latent properties are measured, but downstream generation quality (e.g., perceptual realism) is inferred only indirectly.
Model diversity
- Only six models were evaluated.
- Newer diffusion‑based or transformer encoders could behave differently.
Human perception
- Probing metrics are statistical; a user study is needed to confirm whether the observed entanglements affect perceived controllability.
Future directions
- Explore joint disentanglement objectives that combine contrastive augmentations with adversarial regularization.
- Extend the framework to multimodal settings (e.g., audio‑score alignment) to capture richer musical semantics.

Authors

Laura Ibáñez‑Martínez
Chukwuemeka Nkama
Andrea Poltronieri
Xavier Serra
Martín Rocamora

Paper Information

arXiv ID: 2602.10058v1
Categories: cs.SD, cs.LG, eess.AS
Published: February 10, 2026
PDF: Download PDF

[Paper] Evaluating Disentangled Representations for Controllable Music Generation

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Direct Learning of Calibration-Aware Uncertainty for Neural PDE Surrogates

[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Show HN: Scanned 1927-1945 Daily USFS Work Diary

Are 'Agent Skills' the Secret Sauce for AI Productivity?