[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control appro...
We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control appro...
Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bott...
While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscri...
Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identi...
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection ...
Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under...
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding sched...
Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding...
We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is ch...
Microsoft Releases Phi‑4‑reasoning‑vision‑15B Microsoft announced on Tuesday the launch of Phi‑4‑reasoning‑vision‑15B, a compact open‑weight multimodal AI mode...
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been develope...
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and π^3 have a computational cost that scales...