[Paper] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Published: (February 19, 2026 at 01:23 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17599v1

Overview

The paper introduces Art2Mus, a pioneering system that composes music directly from visual artwork, bypassing the usual “image‑to‑text‑to‑audio” pipeline. By training on a newly assembled dataset of over 100 k artwork‑music pairs, the authors demonstrate that a diffusion‑based model can learn to translate visual style, mood, and cultural cues into coherent musical pieces.

Key Contributions

  • ArtSound dataset – 105,884 paired artworks and music tracks (plus dual‑modality captions), the largest publicly available resource for visual‑to‑audio research.
  • Direct visual conditioning – First framework (ArtToMus) that feeds visual embeddings straight into a latent diffusion model, eliminating the intermediate textual representation.
  • Cross‑modal alignment without language – Shows that meaningful artwork‑music correspondence can be learned even when the model never sees any textual description of the content.
  • Comprehensive evaluation – Quantitative alignment scores, perceptual listening tests, and ablation studies that compare visual‑only conditioning against strong text‑conditioned baselines.
  • Open‑source release – Code, pretrained checkpoints, and the dataset will be released, enabling reproducibility and downstream applications.

Methodology

  1. Data collection – The authors merged the ArtGraph visual collection with tracks from the Free Music Archive, then used automated metadata matching and human verification to create high‑quality artwork‑music pairs. Each pair also received two captions (one describing the image, one describing the music) for auxiliary analysis.
  2. Visual embedding extraction – A pretrained Vision Transformer (ViT) processes the artwork and outputs a dense vector that captures color palette, composition, and stylistic attributes.
  3. Latent diffusion model (LDM) – A music‑generation LDM (trained on a large corpus of symbolic/audio music) is adapted so that its conditioning space can accept the visual embedding directly. The visual vector is projected via a learned linear mapper into the LDM’s latent space.
  4. Training objective – The model minimizes the standard diffusion loss (predicting noise added to the latent music representation) while being guided only by the visual conditioning. No text tokens are involved during forward or backward passes.
  5. Inference – Given a new artwork, its embedding is projected, fed to the LDM, and the model iteratively denoises to produce a 30‑second audio clip that reflects the visual input.

Results & Findings

  • Alignment metrics (CLAP similarity, cross‑modal retrieval) are lower than those of state‑of‑the‑art text‑conditioned models, confirming the added difficulty of removing linguistic supervision.
  • Perceptual quality – In blind listening tests, participants rated ArtToMus outputs on par with text‑conditioned baselines for musical coherence, harmony, and overall pleasantness.
  • Stylistic correspondence – Listeners consistently identified visual cues (e.g., warm color schemes → brighter timbres, abstract brushstrokes → more experimental textures) in the generated music.
  • Ablation – Removing the visual‑to‑latent projection or training the diffusion model from scratch dramatically reduces both quality and visual relevance, underscoring the importance of the projection layer and pre‑training on large music corpora.

Practical Implications

  • Multimedia installations – Curators can automatically generate ambient soundtracks that adapt to exhibited paintings, enhancing visitor immersion without manual scoring.
  • Creative tools for artists – Graphic designers and visual artists can prototype musical accompaniments for their work, fostering interdisciplinary collaborations.
  • Cultural heritage – Museums could enrich digitized collections with historically informed music, using visual style as a proxy for era‑specific sound palettes.
  • Game & VR content – Procedurally generated environments can be paired with on‑the‑fly music that reflects the visual theme, reducing the need for hand‑crafted audio assets.
  • Research platform – The dataset and code provide a testbed for exploring other cross‑modal generation tasks (e.g., video‑to‑music, sculpture‑to‑sound).

Limitations & Future Work

  • Alignment gap – Visual‑only conditioning still lags behind text‑based systems in precise semantic matching; hybrid approaches that combine subtle textual cues might close this gap.
  • Genre diversity – The music corpus leans toward Western popular and classical styles; expanding to non‑Western or experimental genres could improve cultural fidelity.
  • Resolution of visual features – Fine‑grained details (e.g., tiny motifs) are often lost in the embedding, limiting the granularity of musical translation.
  • User control – Current system offers no direct knobs for tempo, instrumentation, or mood beyond what the artwork implicitly conveys; future interfaces could expose these parameters.
  • Evaluation metrics – Objective cross‑modal similarity scores remain imperfect proxies for human perception; richer human‑in‑the‑loop studies are needed.

Art2Mus opens a fresh avenue where visual art can directly inspire sound, turning static images into dynamic auditory experiences. As the tools mature, we can expect a new wave of AI‑assisted creativity that blurs the line between sight and hearing.

Authors

  • Ivan Rinaldi
  • Matteo Mendula
  • Nicola Fanelli
  • Florence Levé
  • Matteo Testi
  • Giovanna Castellano
  • Gennaro Vessio

Paper Information

  • arXiv ID: 2602.17599v1
  • Categories: cs.CV, cs.MM, cs.SD
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »