computer-vision — Page 7

Sort:

1 month ago · ai · - · -

[Paper] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layou...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties ...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Repurposing Geometric Foundation Models for Multi-view Diffusion

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution mat...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual qu...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

Teaching Machines to See (Part 1): Why Vision Is Hard

Human Visual Processing vs. Machine Vision As humans, we can instantly recognize a cat, a dog, and a lady in an image. Our brains combine attention, memory, an...

#computer vision #OpenCV #image processing #machine learning #neural networks
1 month ago · ai · - · -

[Paper] MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over ...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation

Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] CoVR-R:Reason-Aware Composed Video Retrieval

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully s...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods

Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multipl...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering

Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Improving Image-to-Image Translation via a Rectified Flow Reformulation

In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled ...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] Adaptive Greedy Frame Selection for Long Video Understanding

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frame...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can r...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geom...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Matryoshka Gaussian Splatting

The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Spla...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and O...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While exist...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- refl...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Spectrally-Guided Diffusion Noise Schedules

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribut...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the in...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods o...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight conne...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as 'g...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redund...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process st...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior app...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] LoST: Level of Semantics Tokenization for 3D Shapes

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models,...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remain...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Versatile Editing of Video Content, Actions, and Dynamics without Training

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationall...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where capti...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended ho...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] Demystifing Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work ...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] MessyKitchens: Contact-rich object-level 3D scene reconstruction

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achiev...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] SegviGen: Repurposing 3D Generative Model for Part Segmentation

We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black bo...

#research #paper #ai #machine-learning #computer-vision

Newer posts

Older posts