computer vision — Page 14

Sort:

2 months ago · ai · - · -

[Paper] Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along fiv...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks invo...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale trainin...

#robotics #dexterous manipulation #computer vision #trajectory reconstruction #reinforcement learning
2 months ago · ai · - · -

[Paper] GEBench: Benchmarking Image Generation Models as GUI Environments

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, ...

#image-generation #benchmark #GUI #computer-vision #diffusion-models
2 months ago · ai · - · -

[Paper] WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned p...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

New Apple-backed AI model can generate sound and speech from silent videos

markdown !VSSFlow demo imagehttps://9to5mac.com/wp-content/uploads/sites/6/2026/02/vssflow-fi.jpg?quality=82&strip=all&w=1600 VSSFlow – A Unified Audio Generati...

#Apple #VSSFlow #video-to-sound #speech synthesis #multimodal AI #generative audio #computer vision
2 months ago · ai · - · -

Trabalho Aprendizado de Máquina - Pós IA e Data Analisys - Wagner Pereira

O Que Eu Fiz e Por Quê Neste trabalho eu exercitei a criação de um processo de reconhecimento de imagens usando duas abordagens diferentes de inteligência arti...

#machine learning #image recognition #Teachable Machine #Google AI Studio #computer vision #dataset creation #AI tools
2 months ago · ai · - · -

A Normalized Gaussian Wasserstein Distance for Tiny Object Detection

Overview Finding tiny objects in images is challenging because they occupy only a few pixels and can be missed by methods that expect larger shapes. Traditiona...

#tiny object detection #Gaussian Wasserstein distance #computer vision #object detection metrics
2 months ago · ai · - · -

[Paper] SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated rea...

#research #paper #ai #machine-learning #nlp #computer-vision
2 months ago · software · - · -

Flappy Hand in 30 min with copilot cli

Overview FlappyHand is a hands‑free interactive game inspired by the classic Flappy Bird. The character is controlled using hand gestures captured by your webc...

#GitHub Copilot CLI #Next.js #React #MediaPipe #computer vision #hand tracking #web app #flappy hand game
2 months ago · ai · - · -

[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, a...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for cons...

#video generation #diffusion models #implicit 3D representation #computer vision #scene encoding
2 months ago · ai · - · -

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these ...

#robotics #self-supervised learning #world model #video pretraining #computer vision
2 months ago · ai · - · -

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs

Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark stu...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] PANC: Prior-Aware Normalized Cut for Object Segmentation

Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literat...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between t...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Vision Transformer Finetuning Benefits from Non-Smooth Components

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. Ho...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art mode...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most m...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] A neuromorphic model of the insect visual system for natural image processing

Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for under...

#neuromorphic computing #spiking neural networks #self-supervised learning #computer vision #bio-inspired AI
2 months ago · ai · - · -

[Paper] Pseudo-Invertible Neural Networks

The Moore-Penrose Pseudo-inverse (PInv) serves as the fundamental solution for linear systems. In this paper, we propose a natural generalization of PInv to the...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Shared LoRA Subspaces for almost Strict Continual Learning

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forge...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most exi...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, ...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most exist...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goa...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a ...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Context Forcing: Consistent Autoregressive Video Generation with Long Context

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-cont...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrog...

#research #paper #ai #machine-learning #computer-vision
2 months ago · software · - · -

Stop Copy-Pasting from Images: Build a Universal Screen Translator with Python

Lingo‑Live started with a frustration many of us have felt: trying to copy text from a YouTube video or any on‑screen content is impossible. Most of us end up e...

#python #screen-translator #ocr #computer-vision #desktop-app #hotkey #ui-design #translation-api
2 months ago · ai · - · -

[Paper] Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing

In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical ...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Reinforced Attention Learning

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending th...

#research #paper #ai #machine-learning #nlp #computer-vision
2 months ago · ai · - · -

[Paper] CoWTracker: Tracking by Warping instead of Correlation

Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art tracke...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Laminating Representation Autoencoders for Efficient Diffusion

Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. Howeve...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute e...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] PDF-HR: Pose Distance Fields for Humanoid Robots

Pose and motion priors play a crucial role in humanoid robotics. Although such priors have been widely studied in human motion recovery (HMR) domain with a rang...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] LitS: A novel Neighborhood Descriptor for Point Clouds

With the advancement of 3D scanning technologies, point clouds have become fundamental for representing 3D spatial data, with applications that span across vari...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during th...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Toward Reliable and Explainable Nail Disease Classification: Leveraging Adversarial Training and Grad-CAM Visualization

Human nail diseases are gradually observed over all age groups, especially among older individuals, often going ignored until they become severe. Early detectio...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas

Accurate risk stratification of precancerous polyps during routine colonoscopy screenings is essential for lowering the risk of developing colorectal cancer (CR...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] EventNeuS: 3D Mesh Reconstruction from a Single Event Camera

Event cameras offer a considerable alternative to RGB cameras in many scenarios. While there are recent works on event-based novel-view synthesis, dense 3D mesh...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization

In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventio...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a w...

#research #paper #ai #machine-learning #nlp #computer-vision
2 months ago · ai · - · -

[Paper] Continuous Control of Editing Models via Adaptive-Origin Guidance

Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly...

#research #paper #ai #computer-vision

Newer posts

Older posts