[Paper] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Published: (February 10, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10113v1

Overview

The paper introduces ConsID-Gen, a new image‑to‑video (I2V) generation system that can animate a single static picture into a coherent video while keeping the object’s identity intact across changing viewpoints. To tackle the notorious “appearance drift” problem, the authors also release a large‑scale, object‑centric dataset (ConsIDVid) and a dedicated benchmark (ConsIDVid‑Bench) for measuring multi‑view consistency.

Key Contributions

  • ConsIDVid dataset – a curated collection of high‑quality, temporally aligned videos with rich viewpoint variations, built via an automated pipeline.
  • ConsIDVid‑Bench – a benchmarking suite with novel metrics that are sensitive to subtle geometric and appearance changes, enabling fair evaluation of view‑consistency.
  • ConsID‑Gen architecture – a diffusion‑based transformer that fuses semantic (text) and geometric (auxiliary views) cues through a dual‑stream visual‑geometric encoder and a text‑visual connector.
  • State‑of‑the‑art results – ConsID‑Gen outperforms leading video generation models (e.g., Wan2.1, HunyuanVideo) on identity fidelity, temporal coherence, and multi‑view consistency.
  • Open‑source release – code, pretrained weights, and the dataset are made publicly available for reproducibility and downstream research.

Methodology

  1. Data augmentation with auxiliary views – besides the input image, the model receives a few unposed extra views of the same object (generated or retrieved). These provide implicit 3D cues without requiring explicit depth maps.
  2. Dual‑stream encoder
    • Semantic stream: encodes the textual instruction (e.g., “rotate the car 360°”).
    • Geometric stream: processes the stack of auxiliary images to capture viewpoint‑dependent structure.
      The two streams are merged by a text‑visual connector, producing a unified conditioning vector.
  3. Diffusion Transformer backbone – a latent diffusion model guided by the conditioning vector generates video frames sequentially, while a transformer handles long‑range temporal dependencies.
  4. Training regime – the model is trained on ConsIDVid with a combination of reconstruction loss, identity‑preserving contrastive loss, and a view‑consistency regularizer that penalizes geometric drift across frames.

The pipeline stays fully differentiable, allowing end‑to‑end training and easy integration with existing diffusion‑based video generators.

Results & Findings

Metric (higher is better)ConsID‑GenWan2.1HunyuanVideo
Identity Preservation (ID‑F1)0.870.710.68
View Consistency (VC‑Score)0.820.600.58
Temporal Coherence (TC‑LPIPS)0.120.210.19
FVD (lower is better)210380345
  • ConsID‑Gen consistently reduces appearance drift, even when the video requires large viewpoint changes (e.g., 360° rotations).
  • Qualitative examples show sharper edges, stable textures, and faithful preservation of object colors and shapes across frames.
  • Ablation studies confirm that both the auxiliary‑view input and the dual‑stream encoder contribute significantly to the gains.

Practical Implications

  • Product demos & AR/VR – developers can turn a single product photo into a rotating 3D‑like video for catalogs, virtual showrooms, or immersive experiences without costly multi‑camera rigs.
  • Content creation tools – integration into video editing suites (e.g., Adobe Premiere, After Effects) could let creators animate characters or objects from a single sketch while keeping the artist’s style intact.
  • Game asset pipelines – generate consistent animation loops for NPCs or items from concept art, reducing manual keyframe work.
  • Robotics & simulation – synthetic video data with accurate viewpoint changes can improve training of perception models that need to recognize objects from many angles.

Because the model runs on standard GPU‑accelerated diffusion frameworks, it can be deployed as a cloud service or an on‑premise plugin with modest hardware (e.g., a single RTX 3090 can generate a 5‑second 256×256 clip in under a minute).

Limitations & Future Work

  • Auxiliary view requirement – the current pipeline assumes access to a few extra views; generating or retrieving these automatically for arbitrary objects remains an open challenge.
  • Resolution ceiling – experiments focus on 256×256 videos; scaling to high‑definition (1080p+) may need architectural tweaks and more compute.
  • Domain bias – ConsIDVid primarily contains everyday objects; performance on highly stylized or abstract imagery (e.g., cartoons) is not fully explored.
  • Future directions suggested by the authors include learning to synthesize auxiliary views on the fly, extending the model to handle dynamic backgrounds, and integrating explicit 3D priors (e.g., neural radiance fields) for even tighter geometry control.

Authors

  • Mingyang Wu
  • Ashirbad Mishra
  • Soumik Dey
  • Shuo Xing
  • Naveen Ravipati
  • Hansi Wu
  • Binbin Li
  • Zhengzhong Tu

Paper Information

  • arXiv ID: 2602.10113v1
  • Categories: cs.CV
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »