computer-vision — Page 37

1 month ago · ai

[Paper] PPTArena: A Benchmark for Agentic PowerPoint Editing

We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast t...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coh...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this e...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain co...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive sy...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversaria...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] In-Context Sync-LoRA for Portrait Video Editing

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, express...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai

[Paper] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative fram...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions of...

#research #paper #ai #computer-vision
1 month ago · ai

[Paper] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack metho...

#research #paper #ai #nlp #computer-vision
1 month ago · ai

[Paper] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because...

#research #paper #ai #computer-vision

Newer posts

Older posts