[Paper] Uncertainty Quantification for Visual Object Pose Estimation
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics probl...
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics probl...
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency wit...
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation....
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-...
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benc...
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of lab...
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally...
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or ...
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applicatio...
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the cons...
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces sign...
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creat...
Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development ...
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now em...
Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its im...
The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentat...
We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention...
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the para...
Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning toke...
Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the subs...
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically u...
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimiz...
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed b...
'Thinking with images' has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evi...