computer-vision — Page 6

Sort:

0 month ago · ai · - · -

[Paper] Vega: Learning to Drive with Natural Language Instructions

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only ...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accel...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] MegaFlow: Zero-Shot Large Displacement Optical Flow

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-spe...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automat...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] How good was my shot? Quantifying Player Skill Level in Table Tennis

Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the o...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repeti...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] PixelSmile: Toward Fine-Grained Facial Expression Editing

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) datas...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limit...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictor...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

BiRefNet vs rembg vs U2Net: Which Background Removal Model Actually Works in Production?

Background removal at scale I've spent the last few months running background removal on tens of thousands of images through different models, and the differen...

#background removal #image segmentation #BiRefNet #rembg #U2Net #computer vision #production deployment #deep learning models
0 month ago · ai · - · -

Apple trained an AI that captions images better than models ten times its size

markdown !Machine Learning Researchhttps://9to5mac.com/wp-content/uploads/sites/6/2025/07/machine-learning-research.jpg?quality=82&strip=all&w=1600 Apple resear...

#Apple #image captioning #dense captioning #RubiCap #reinforcement learning #multimodal AI #model efficiency #computer vision
0 month ago · ai · - · -

Augmenting citizen science with computer vision for fish monitoring

Background Each spring, river herring populations migrate from Massachusetts coastal waters to begin their annual journey up rivers and streams to freshwater s...

#computer vision #citizen science #fish monitoring #environmental AI #marine conservation #underwater video analysis #MIT CSAIL #population dynamics
0 month ago · ai · - · -

[Paper] TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliab...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-info...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] Vision-Language Models vs Human: Perceptual Image Quality Assessment

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage aut...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and inst...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Ma...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible sema...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] Towards Training-Free Scene Text Editing

Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require t...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos ...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing...

#research #paper #ai #computer-vision
0 month ago · ai · - · -

[Paper] LensWalk: Agentic Video Understanding by Planning How You See in Videos

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai · - · -

[Paper] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven ...

#research #paper #ai #machine-learning #nlp #computer-vision
0 month ago · ai · - · -

[Paper] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing metho...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

Ai2 Releases MolmoWeb: A Game-Changer for Visual Web Agents

Introduction Imagine a personal assistant that can browse the internet, complete tasks, and interact with websites just like a human would. Ai2's recent releas...

#MolmoWeb #visual web agents #AI2 #AI assistants #web automation #computer vision #large language models
1 month ago · ai · - · -

[Paper] OccAny: Generalized Unconstrained Urban 3D Occupancy

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain gener...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifac...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial ...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approa...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generat...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this tas...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessar...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existin...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual ...

#research #paper #ai #nlp #computer-vision
1 month ago · ai · - · -

[Paper] VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce t...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models of...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challeng...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment de...

#research #paper #ai #nlp #computer-vision
1 month ago · ai · - · -

[Paper] WorldCache: Content-Aware Caching for Accelerated Video World Models

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse q...

#research #paper #ai #computer-vision
1 month ago · ai · - · -

[Paper] End-to-End Training for Unified Tokenization and Latent Denoising

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex st...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB imag...

#research #paper #ai #machine-learning #computer-vision
1 month ago · ai · - · -

[Paper] ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, ...

#research #paper #ai #machine-learning #nlp #computer-vision
1 month ago · ai · - · -

[Paper] DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VL...

#research #paper #ai #computer-vision

Newer posts

Older posts