[Paper] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coh...
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coh...
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this e...
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain co...
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive sy...
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversaria...
Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent...
Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, express...
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative fram...
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions of...
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack metho...
Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because...
In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are ofte...
We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geomet...
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for a...
Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3D...
The corner shop that predicts your shopping habits better than Amazon. The local restaurant that automates its supply chain with the precision of McDonald's. Th...
Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring...
!Cover image for How to Fix Croanged Documents Before OCR Runshttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https...
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI ta...
Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological scien...
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consume...
Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their represen...
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially unde...
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that bu...