[Paper] Canvas-to-Image: Compositional Image Generation with Multimodal Controls
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, ...
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, ...
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - human...
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We att...
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to syste...
Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors....
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics probl...
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency wit...
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation....
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-...
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benc...
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of lab...
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally...
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or ...
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applicatio...
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the cons...
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces sign...
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creat...
Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development ...
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now em...
Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its im...
The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentat...
We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention...
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the para...
Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning toke...
Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the subs...
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically u...
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimiz...
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed b...
'Thinking with images' has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evi...
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions....
Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns wi...
Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is...
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical da...
This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport i...
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operat...
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforc...