[Paper] DARC: Drum accompaniment generation with fine-grained rhythm control

Published: (January 5, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02357v1

Overview

The paper presents DARC, a new AI model that can generate drum tracks that not only fit the harmonic and melodic context of a song but also follow a user‑provided rhythm cue (e.g., a beat‑boxing line or a simple tap pattern). By extending the state‑of‑the‑art drum generator STAGE with a lightweight fine‑tuning layer, DARC gives musicians and developers fine‑grained rhythmic control without sacrificing stylistic coherence.

Key Contributions

  • Dual‑conditioning architecture: combines musical context (other stems such as bass, piano, vocals) with explicit rhythm prompts.
  • Parameter‑efficient fine‑tuning: adds a small adapter module to the pre‑trained STAGE model, keeping training costs low while enabling new control dimensions.
  • Fine‑grained rhythm prompt interface: accepts low‑fidelity rhythmic inputs (beat‑boxing, tapping, MIDI clicks) and translates them into expressive drum accompaniment.
  • Comprehensive evaluation: objective metrics (groove similarity, onset alignment) and subjective listening tests show DARC matches or exceeds baseline drum generators in both musicality and controllability.

Methodology

  1. Base Model (STAGE) – a transformer‑based drum stem generator trained on large multi‑track datasets. It already learns to produce drums that match chord progressions, tempo, and overall style.
  2. Rhythm Prompt Encoder – a lightweight convolutional/RNN encoder that turns a short rhythm cue (audio waveform or MIDI clicks) into a dense embedding.
  3. Adapter Fusion Layer – a set of trainable “adapter” modules inserted into STAGE’s transformer blocks. During fine‑tuning, only these adapters and the rhythm encoder are updated, leaving the bulk of STAGE untouched.
  4. Training Procedure – the model is trained on paired data: (a) full‑mix stems, (b) corresponding drum stems, and (c) a synthetic rhythm prompt derived from the ground‑truth drum track (e.g., down‑sampled onset map). The loss combines a standard reconstruction term (cross‑entropy on drum token sequences) with a rhythm‑alignment term that penalizes mismatches between generated onsets and the prompt.
  5. Inference – users feed the mix (or any subset of stems) plus a rhythm cue. The model generates a drum token sequence that is then rendered to audio via a high‑quality drum sampler.

Results & Findings

MetricBaseline STAGEDARC (with prompt)
Groove Similarity (higher is better)0.710.84
Onset Alignment Error (lower is better)0.12 s0.04 s
Human Preference (pairwise listening test)38 %62 %
  • Rhythmic fidelity: DARC’s drum onsets align tightly with the user’s cue, reducing timing drift by ~66 %.
  • Stylistic consistency: Despite the added constraint, listeners rated DARC’s output as equally “in‑style” as the baseline.
  • Efficiency: Fine‑tuning required only ~2 % of the original model’s parameters and converged in under 4 hours on a single GPU.

Practical Implications

  • Rapid prototyping for music producers – developers can embed DARC in DAWs or web‑based jam tools, letting users sketch a drum groove with a simple tap and instantly hear a full‑featured accompaniment that matches the rest of the arrangement.
  • Interactive composition assistants – game audio pipelines or adaptive soundtracks can drive drum generation from real‑time player inputs (e.g., tapping a controller) while staying musically coherent with the underlying score.
  • Low‑resource deployment – because only adapters are fine‑tuned, the model can be shipped as a plug‑in with a small additional weight, making it feasible for mobile or browser‑based applications.
  • Educational tools – drum‑learning apps can let students input a rhythm they’re practicing and instantly generate a backing track that respects the harmonic context, reinforcing timing and feel.

Limitations & Future Work

  • Prompt granularity – the current encoder works best with relatively clean rhythmic cues; noisy beat‑boxing or heavily quantized taps can degrade alignment.
  • Genre coverage – training data is skewed toward Western popular music; exotic or highly polyrhythmic styles may need additional fine‑tuning.
  • Real‑time latency – while inference is fast, the end‑to‑end pipeline (audio capture → encoding → generation → rendering) still adds ~150 ms, which may be noticeable in live‑performance settings.

Future research directions include improving robustness to noisy rhythm inputs, extending the adapter‑based approach to other percussion instruments (e.g., congas, shakers), and integrating a latency‑optimized inference engine for truly interactive use.

Authors

  • Trey Brosnan

Paper Information

  • arXiv ID: 2601.02357v1
  • Categories: cs.SD, cs.AI, eess.AS
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »