computer-vision — Page 26

Sort:

5 months ago · ai · - · -

[Paper] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, ...

#image generation #diffusion models #multimodal control #computer vision #research
5 months ago · ai · - · -

[Paper] TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - human...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We att...

#research #paper #ai #machine-learning #nlp #computer-vision
5 months ago · ai · - · -

[Paper] Seeing without Pixels: Perception from Camera Trajectories

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to syste...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors....

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Uncertainty Quantification for Visual Object Pose Estimation

Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics probl...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency wit...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation....

#action-quality-assessment #causal-inference #video-analysis #computer-vision #long-term-temporal-modeling
5 months ago · ai · - · -

[Paper] Mechanisms of Non-Monotonic Scaling in Vision Transformers

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Qwen3-VL Technical Report

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benc...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Active Learning for GCN-based Action Recognition

Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of lab...

#active learning #graph convolutional networks #action recognition #skeleton-based vision #computer vision
5 months ago · ai · - · -

[Paper] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or ...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applicatio...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the cons...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Video Generation Models Are Good Latent Reward Models

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces sign...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creat...

#sign-language #dataset #translation #computer-vision #benchmark
5 months ago · ai · - · -

[Paper] The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development ...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now em...

#ensemble learning #remote sensing #foundation models #computer vision #sustainability
5 months ago · ai · - · -

[Paper] Self-Paced Learning for Images of Antinuclear Antibodies

Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its im...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Generalized Design Choices for Deepfake Detectors

The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentat...

#deepfake detection #computer vision #benchmarking #model optimization
5 months ago · ai · - · -

[Paper] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention...

#knowledge distillation #cross-attention #computer vision #model compression #deep learning
5 months ago · ai · - · -

[Paper] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the para...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Frequency-Aware Token Reduction for Efficient Vision Transformer

Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning toke...

#vision transformers #token reduction #frequency-aware pruning #computer vision #model efficiency
5 months ago · ai · - · -

[Paper] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the subs...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically u...

#event cameras #hypergraph neural network #multimodal fusion #computer vision #deep learning
5 months ago · ai · - · -

[Paper] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimiz...

#equivariant neural networks #multimodal 3D reconstruction #point cloud processing #computer vision
5 months ago · ai · - · -

[Paper] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed b...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Monet: Reasoning in Latent Visual Space Beyond Images and Language

'Thinking with images' has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evi...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions....

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns wi...

#4D Gaussian Splatting #endoscopic reconstruction #computer vision #depth estimation #real-time rendering
5 months ago · ai · - · -

[Paper] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is...

#research #paper #ai #computer-vision
5 months ago · ai · - · -

[Paper] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical da...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport i...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operat...

#research #paper #ai #machine-learning #computer-vision
5 months ago · ai · - · -

[Paper] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforc...

#research #paper #ai #nlp #computer-vision

Newer posts

Older posts