computer-vision — Page 5

Sort:

2 weeks ago · ai · - · -

[Paper] PARE: Pruning and Adaptive Routing for Efficient Video Generation

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. ...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversio...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing sp...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through s...

#research #paper #ai #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most exi...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of r...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing app...

#research #paper #ai #machine-learning #computer-vision
2 weeks ago · ai · - · -

[Paper] Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tunin...

#research #paper #ai #machine-learning #nlp #computer-vision
2 weeks ago · ai · - · -

[Paper] Helix4D: Complex 4D Mesh Generation

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic me...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal s...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that th...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Global Structure-from-Motion Meets Feedforward Reconstruction

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] InstructSAM: Segment Any Instance with Any Instructions

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulate...

#research #paper #ai #computer-vision
2 weeks ago · ai · - · -

[Paper] Channel-wise Vector Quantization

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventi...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] Geo-Align: Video Generation Alignment via Metric Geometry Reward

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Sup...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] ETCHR: Editing To Clarify and Harness Reasoning

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grai...

#research #paper #ai #machine-learning #nlp #computer-vision
3 weeks ago · ai · - · -

[Paper] From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer fr...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we pro...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulato...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Leveraging Foundation Models for Causal Generative Modeling

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus ...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and findin...

#research #paper #ai #nlp #computer-vision
3 weeks ago · ai · - · -

[Paper] Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-p...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Cambrian-P: Pose-Grounded Video Understanding

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art m...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) flee...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facil...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging ...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Spectral Tail Auxiliary Learning for AI-Generated Image Detection

As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increas...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] WorldKV: Efficient World Memory with World Retrieval and Compression

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a pr...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inerti...

#research #paper #ai #machine-learning #nlp #computer-vision
3 weeks ago · ai · - · -

[Paper] The Neglected Baseline in Model Interpretation

We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretatio...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Variance Reduction for Expectations with Diffusion Teachers

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teache...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training....

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottl...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-w...

#research #paper #ai #machine-learning #computer-vision
3 weeks ago · ai · - · -

[Paper] Latent Dynamics for Full Body Avatar Animation

Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements de...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual obser...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfy...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics t...

#research #paper #ai #computer-vision
3 weeks ago · ai · - · -

[Paper] TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations,...

#research #paper #ai #machine-learning #computer-vision

Newer posts

Older posts