computer-vision — Page 39

1 month ago · ai

[Paper] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that bu...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in bo...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] AirSim360: A Panoramic Simulation Platform within Drone View

The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-sca...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] MV-TAP: Tracking Any Point in Multi-View Videos

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to ...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Learning Visual Affordance from Audio

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that ...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift whe...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai

[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks

Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning trace...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still 'think about videos' ie once a video is encoded, reasoning unf...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video gene...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Visual Generation Tuning

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned wi...

#research #paper #ai #computer-vision

Newer posts

Older posts