computer-vision — Page 34

Sort:

3 months ago · ai · - · -

[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in bo...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] AirSim360: A Panoramic Simulation Platform within Drone View

The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-sca...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] MV-TAP: Tracking Any Point in Multi-View Videos

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Learning Visual Affordance from Audio

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift whe...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks

Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning trace...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still 'think about videos' ie once a video is encoded, reasoning unf...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video gene...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Visual Generation Tuning

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned wi...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Object-Centric Data Synthesis for Category-level Object Detection

Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capab...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval

Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis towar...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] MANTA: Physics-Informed Generalized Underwater Object Tracking

Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water cond...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previo...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Optimizing Multimodal Language Models through Attention-based Interpretability

Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimoda...

#research #paper #ai #nlp #computer-vision
3 months ago · ai · - · -

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilitie...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

[Paper] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, ...

#image generation #diffusion models #multimodal control #computer vision #research
3 months ago · ai · - · -

[Paper] TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - human...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We att...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

[Paper] Seeing without Pixels: Perception from Camera Trajectories

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to syste...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors....

#research #paper #ai #computer-vision

Newer posts

Older posts