[Paper] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layou...
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layou...
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties ...
While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis...
Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution mat...
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual qu...
Human Visual Processing vs. Machine Vision As humans, we can instantly recognize a cat, a dog, and a lady in an image. Our brains combine attention, memory, an...
Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable...
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched...
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over ...
Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally ...
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully s...
Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multipl...
Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-...
In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks ...
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled ...
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frame...
It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can r...
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geom...
The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Spla...
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising...
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual ...
Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are...
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and O...
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While exist...
We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- refl...
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless ...
Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribut...
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the in...
Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods o...
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight conne...
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high...
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as 'g...
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redund...
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process st...
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and...
In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and...
We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior app...
Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models,...
Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remain...
Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect...
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationall...
Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. ...
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where capti...
Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended ho...
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work ...
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achiev...
We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D...
Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black bo...