computer-vision — Page 25

Sort:

4 months ago · ai · - · -

[Paper] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the ...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] On the Temporality for Sketch Representation Learning

Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced si...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

We propose MagicQuill V2, a novel system that introduces a layered composition paradigm to generative image editing, bridging the gap between the sema...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consiste...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] OneThinker: All-in-one Reasoning Model for Image and Video

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, exi...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] PPTArena: A Benchmark for Agentic PowerPoint Editing

We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast t...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coh...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this e...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain co...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive sy...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversaria...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] In-Context Sync-LoRA for Portrait Video Editing

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, express...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative fram...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions of...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack metho...

#research #paper #ai #nlp #computer-vision
4 months ago · ai · - · -

[Paper] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems

In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are ofte...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Layout Anything: One Transformer for Universal Room Layout Estimation

We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geomet...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for a...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis

Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3D...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

The Great Equaliser

The corner shop that predicts your shopping habits better than Amazon. The local restaurant that automates its supply chain with the precision of McDonald's. Th...

#AI democratization #small business AI #machine learning #natural language processing #computer vision #automation #enterprise AI tools
4 months ago · ai · - · -

[Paper] Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education

Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring...

#research #paper #ai #computer-vision
4 months ago · software · - · -

How to Fix Crooked Documents Before OCR Runs

!Cover image for How to Fix Croanged Documents Before OCR Runshttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https...

#OCR #image preprocessing #document scanning #text extraction #computer vision #image correction #devtools
4 months ago · ai · - · -

[Paper] EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI ta...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Data-Centric Visual Development for Self-Driving Labs

Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological scien...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consume...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now

Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their represen...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Generative Video Motion Editing with 3D Point Tracks

Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially unde...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that bu...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in bo...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] AirSim360: A Panoramic Simulation Platform within Drone View

The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-sca...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] MV-TAP: Tracking Any Point in Multi-View Videos

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Learning Visual Affordance from Audio

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift whe...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong...

#research #paper #ai #machine-learning #nlp #computer-vision
4 months ago · ai · - · -

[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks

Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning trace...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still 'think about videos' ie once a video is encoded, reasoning unf...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video gene...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Visual Generation Tuning

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned wi...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Object-Centric Data Synthesis for Category-level Object Detection

Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capab...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval

Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis towar...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] MANTA: Physics-Informed Generalized Underwater Object Tracking

Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water cond...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previo...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Optimizing Multimodal Language Models through Attention-based Interpretability

Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimoda...

#research #paper #ai #nlp #computer-vision

Newer posts

Older posts