computer-vision — Page 4

Sort:

1 week ago · ai · - · -

[Paper] SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters ...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] Neuron Populations Exhibit Divergent Selectivity with Scale

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as lo...

#research #paper #ai #machine-learning #nlp #computer-vision
1 week ago · ai · - · -

[Paper] PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Tran...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] NewtPhys: Do Foundation Models Understand Newtonian Physics?

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these ...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and ...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical ...

#research #paper #ai #machine-learning #computer-vision
1 week ago · ai · - · -

[Paper] RoboDream: Compositional World Models for Scalable Robot Data Synthesis

Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-c...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire n...

#research #paper #ai #machine-learning #computer-vision
1 week ago · ai · - · -

[Paper] From Zero to Hero: Training-Free Custom Concept Spawning in World Models

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments ...

#research #paper #ai #computer-vision
1 week ago · ai · - · -

[Paper] AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video ML...

#research #paper #ai #machine-learning #nlp #computer-vision
1 week ago · ai · - · -

[Paper] Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points ...

#research #paper #ai #machine-learning #computer-vision
1 week ago · ai · - · -

[Paper] Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classif...

#research #paper #ai #machine-learning #computer-vision
1 week ago · ai · - · -

[Paper] What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generato...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) ...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of t...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is kno...

#research #paper #ai #machine-learning #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs. We pres...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on real...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

Apple to showcase computer vision studies at annual conference in June

!https://9to5mac.com/wp-content/uploads/sites/6/2025/07/machine-learning-research.jpg?quality=82&strip=all&w=1600 Apple has shared details of its participation...

#Apple #CVPR2025 #computer vision #machine learning research #AI conference #AMUSE #AToken #audio-visual benchmark #pattern recognition
2 weeks ago · ai · - · -

[Paper] GMOS: Grounding Moving Object Segmentation in 3D Space and Time

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two f...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which ...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] AdaState: Self-Evolving Anchors for Streaming Video Generation

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These ...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] NeuROK: Generative 4D Neural Object Kinematics

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simul...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] YoCausal: How Far is Video Generation from World Model? A Causality Perspective

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical tempor...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] GPIC: A Giant Permissive Image Corpus for Visual Generation

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of ...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Benchmarking Single-Factor Physical Video-to-Audio Generation

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Exis...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications s...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Colored Noise Diffusion Sampling

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency ...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Ciphera: A Decentralised Biometric Identity Framework

Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentral...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] From Pixels to Words -- Towards Native One-Vision Models at Scale

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework tha...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signa...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] HarmoVid: Relightful Video Portrait Harmonization

We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination inten...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similar...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Personal Visual Memory from Explicit and Implicit Evidence

Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are inc...

#research #paper #ai #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist fou...

#research #paper #ai #machine-learning #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-b...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias ...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] The Abstraction Gap in Vision-Language Causal Reasoning

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reaso...

#research #paper #ai #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] Self-Prophetic Decoding to Unlock Visual Search in LVLMs

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thin...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] GUI Agents for Continual Game Generation

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot t...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate ...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players cap...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple ...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Feedforward 3D Editing Learns from Semantic-Part Transformation

3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative parad...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in perso...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a ch...

#research #paper #ai #machine-learning #computer-vision

Newer posts

Older posts