[Paper] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Published: (December 11, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10943v1

Overview

AlcheMinT is a new framework that lets you tell a diffusion‑based video generator exactly when each subject should appear or disappear. By adding explicit timestamps to the prompt, the model can produce multi‑subject videos that stay faithful to each character’s look while following a user‑defined temporal script—opening doors to compositional video synthesis, storyboarding, and controllable animation.

Key Contributions

  • Timestamp‑conditioned generation – Introduces a positional‑encoding scheme that binds subject identities to specific time intervals inside the video.
  • Lightweight integration – Implements the temporal control via token‑wise concatenation, avoiding extra cross‑attention layers and adding only a negligible number of parameters.
  • Subject‑descriptive tokens – Adds dedicated text tokens that reinforce the link between a subject’s visual identity and its caption, reducing ambiguity.
  • Comprehensive benchmark – Proposes evaluation metrics for multi‑subject identity preservation, overall video fidelity, and adherence to the temporal script.
  • State‑of‑the‑art quality – Matches or exceeds existing subject‑driven video personalization methods while delivering fine‑grained temporal control for the first time.

Methodology

  1. Base model – Starts from a pretrained text‑to‑video diffusion model that already supports subject‑driven generation via learned subject embeddings.
  2. Temporal positional encoding – Extends the model’s existing positional embeddings with a timestamp encoding that maps each subject token to a start‑ and end‑frame interval. This encoding is added directly to the token embeddings before they enter the diffusion UNet.
  3. Subject‑descriptive tokens – For every subject, a short textual token (e.g., <person_A>) is inserted into the prompt. These tokens are learned jointly with the subject embeddings, ensuring the model knows which visual appearance belongs to which timestamp.
  4. Token‑wise concatenation – The timestamp encoding and the subject token are concatenated at the token level, so the diffusion backbone sees a single enriched token stream. No extra attention modules are required, keeping the computational overhead minimal.
  5. Training & fine‑tuning – The system is fine‑tuned on a curated dataset of short clips containing multiple subjects with known appearance intervals, using a standard diffusion loss plus a temporal consistency regularizer.

Results & Findings

  • Visual fidelity – Measured by FVD (Fréchet Video Distance) and CLIP‑based image quality scores, AlcheMinT’s outputs are on par with the best subject‑personalized video generators.
  • Identity preservation – Across 5‑subject test videos, the average identity similarity (using a face/object encoder) improved by ~12 % compared to baselines that lack temporal control.
  • Temporal adherence – A newly introduced “timestamp accuracy” metric shows > 90 % of frames respect the prescribed appearance intervals, whereas prior methods often bleed subjects across unintended frames.
  • Parameter efficiency – The added timestamp and descriptive token embeddings increase the model size by < 0.5 %, and inference speed remains within 5 % of the original diffusion pipeline.

Practical Implications

  • Storyboarding & Pre‑visualization – Filmmakers can script when characters enter a scene and instantly generate a rough video mock‑up, cutting down on manual layout work.
  • Dynamic Advertising – Brands can create personalized ads where a product appears exactly at the desired moment in a user‑generated clip.
  • Game Asset Animation – Developers can generate short cut‑scenes or UI animations that synchronize character appearances with narrative beats without hand‑animating each frame.
  • Educational Content – Instructors can produce tutorial videos where visual aids (e.g., diagrams, objects) pop in and out at precise timestamps, improving clarity.
  • Composable pipelines – Because AlcheMinT plugs into existing diffusion video generators with minimal overhead, it can be added to current production pipelines (e.g., Runway, Stability AI) without a full model rebuild.

Limitations & Future Work

  • Short clip focus – The current training data consists of clips ≤ 8 seconds; longer narratives may require hierarchical temporal modeling.
  • Subject count scaling – While 3‑5 subjects work well, handling dozens of concurrent identities still degrades identity fidelity.
  • Complex motion – Rapid, non‑linear motions (e.g., fast cuts, camera shakes) sometimes confuse the timestamp encoder, leading to slight temporal drift.
  • Future directions – The authors suggest extending the positional encoding to hierarchical time scales (scenes → shots), integrating audio cues for multimodal control, and exploring zero‑shot adaptation to new subjects without fine‑tuning.

Authors

  • Sharath Girish
  • Viacheslav Ivanov
  • Tsai‑Shien Chen
  • Hao Chen
  • Aliaksandr Siarohin
  • Sergey Tulyakov

Paper Information

  • arXiv ID: 2512.10943v1
  • Categories: cs.CV, cs.AI
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »