[Paper] Evaluating Disentangled Representations for Controllable Music Generation

Published: (February 10, 2026 at 01:25 PM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.10058v1

Overview

The paper “Evaluating Disentangled Representations for Controllable Music Generation” investigates how well current unsupervised models actually separate musical concepts such as structure vs. timbre (or local vs. global features). By probing the learned embeddings with a systematic framework, the authors expose a gap between the intended semantics of these representations and what they truly capture—an insight that matters for anyone building AI‑driven music tools that need fine‑grained control.

Key Contributions

  • Comprehensive probing framework that evaluates disentangled music embeddings along four axes:

    1. Informativeness – how much task‑relevant information is retained.
    2. Equivariance – whether controlled transformations (e.g., pitch shift) are reflected linearly in the latent space.
    3. Invariance – whether unrelated factors (e.g., timbre when probing structure) stay constant.
    4. Disentanglement – the degree to which each latent dimension isolates a single musical factor.
  • Benchmark of six state‑of‑the‑art unsupervised models employing diverse disentanglement strategies (inductive biases, data augmentations, adversarial losses, staged training).

  • Ablation studies that isolate the impact of individual design choices (e.g., adding a timbre‑adversarial head vs. using pitch‑preserving augmentations).

  • Cross‑dataset analysis on both synthetic (MIDI‑derived) and real‑world audio collections, demonstrating that results are not dataset‑specific.

  • Critical finding: many “disentangled” embeddings still entangle structure and timbre, limiting reliable controllability in downstream generation pipelines.

Methodology

  1. Model selection – Choose a representative set of music‑audio encoders that claim to learn disentangled latent spaces.

    • Variational autoencoders (VAEs) with structured priors
    • Contrastive models using pitch‑preserving augmentations
    • Adversarially regularized encoders
  2. Probing pipeline – For each model:

    • Freeze the encoder.
    • Train lightweight linear probes (logistic regression or ridge regression) to predict a suite of musical attributes, e.g., beat position, chord progression, instrument class, pitch contour, etc.
  3. Controlled transformations – Apply known audio manipulations to test:

    • Equivariance – Does the latent representation move predictably with the transformation?
    • Invariance – Does the unrelated sub‑space remain unchanged?

    Examples of transformations: pitch shifting, time‑stretching, timbre substitution.

  4. Metrics

    AxisMetricDescription
    InformativenessProbe accuracy / R²Compared to a fully supervised baseline
    EquivarianceCorrelation (transformation magnitude ↔ latent shift)Measures linear relationship
    InvarianceVariance of the “other” sub‑space under transformationLower variance = better invariance
    DisentanglementMutual Information Gap (MIG) & SAP scores (audio‑adapted)Quantifies factor separation
  5. Ablations – Retrain selected models with/without specific components (e.g., removing the adversarial loss) to assess how each trick influences the four axes.

Results & Findings

Model / StrategyInformativenessEquivariance (pitch)Invariance (timbre)Disentanglement (MIG)
VAE w/ structured prior★★★★★★★★★★
Contrastive + pitch‑aug★★★★★★★★★★★
Adversarial timbre‑scrub★★★★★★★★★
Staged (pre‑train → fine‑tune)★★★★★★★★★★★★
Baseline (no disentanglement)★★★★★★★★

Key takeaways

  • Informativeness is generally high across all models—latent spaces retain enough musical signal for downstream tasks.
  • Equivariance is inconsistent; only contrastive models with explicit pitch‑preserving augmentations reliably map pitch shifts to linear latent changes.
  • Invariance to timbre is rarely achieved; even adversarial approaches leave residual timbre leakage.
  • Disentanglement scores are modest, indicating that most embeddings still mix structure and timbre.
  • Ablation insights:
    • Data augmentations contribute more to equivariance than adversarial losses.
    • Inductive biases (e.g., hierarchical encoders) improve informativeness but do not substantially boost disentanglement.

Practical Implications

  • Tool builders (e.g., DAW plugins, generative‑music APIs) cannot assume that a “structure” latent will stay constant when swapping instruments, nor that a “timbre” latent will preserve rhythmic content. Extra conditioning or post‑hoc correction may be required.

  • Interactive music generation—where a user drags a slider to change “style” or “groove”—needs explicit verification that the underlying latent behaves as intended; otherwise, the UI may feel unpredictable.

  • Dataset curation matters: models trained on synthetic MIDI‑derived audio show slightly cleaner equivariance, suggesting that pre‑training on clean, well‑annotated data could be a practical shortcut before fine‑tuning on raw recordings.

  • Design guidelines: Incorporating pitch‑preserving augmentations and hierarchical encoder structures appears more beneficial than adversarial timbre scrubbing for achieving controllable latent spaces.

  • Evaluation pipelines: The probing framework itself can be integrated into CI pipelines for music‑generation models, giving developers a quantitative sanity check before releasing new versions.

Limitations & Future Work

  • Scope of transformations

    • The study examines only pitch, time‑stretch, and timbre swaps.
    • Other musical dimensions (e.g., articulation, dynamics) remain untested.
  • Audio quality not assessed

    • Latent properties are measured, but downstream generation quality (e.g., perceptual realism) is inferred only indirectly.
  • Model diversity

    • Only six models were evaluated.
    • Newer diffusion‑based or transformer encoders could behave differently.
  • Human perception

    • Probing metrics are statistical; a user study is needed to confirm whether the observed entanglements affect perceived controllability.
  • Future directions

    • Explore joint disentanglement objectives that combine contrastive augmentations with adversarial regularization.
    • Extend the framework to multimodal settings (e.g., audio‑score alignment) to capture richer musical semantics.

Authors

  • Laura Ibáñez‑Martínez
  • Chukwuemeka Nkama
  • Andrea Poltronieri
  • Xavier Serra
  • Martín Rocamora

Paper Information

  • arXiv ID: 2602.10058v1
  • Categories: cs.SD, cs.LG, eess.AS
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »