computer-vision — Page 21

Sort:

4 months ago · ai · - · -

[Paper] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Exi...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)

Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task sem...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

**Myth: Computer Vision is only effective for images and not

Myth: Computer Vision is only effective for images and not for videos. Reality: Computer Vision can handle both images and videos, thanks to advancements in tem...

#computer vision #video analysis #deep learning #temporal processing #AI myths
4 months ago · ai · - · -

[Paper] Application of deep learning approaches for medieval historical documents transcription

Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with La...

#research #paper #ai #machine-learning #nlp #computer-vision
4 months ago · ai · - · -

In Defense of the Triplet Loss for Person Re-Identification

Introduction Person re-identification re-ID is the task of finding the same individual across different camera views. It has important applications in security...

#triplet loss #person re-identification #computer vision #deep learning #metric learning #end-to-end training
4 months ago · ai · - · -

Improved Baselines with Momentum Contrastive Learning

Overview Teaching computers to recognize patterns without labeled data—known as unsupervised learning—has become more accessible thanks to simple tweaks to the...

#momentum contrast #MoCo #contrastive learning #unsupervised learning #data augmentation #baseline improvement #computer vision
4 months ago · ai · - · -

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level re...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Adversarial Robustness of Vision in Open Foundation Models

With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans natura...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical....

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] InSPECT: Invariant Spectral Features Preservation of Diffusion Models

Modern diffusion models (DMs) have achieved state-of-the-art image generation. However, the fundamental design choice of diffusing data all the way to white noi...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Interpretable Plant Leaf Disease Detection Using Attention-Enhanced CNN

Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an i...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology

We introduce PathBench-MIL, an open-source AutoML and benchmarking framework for multiple instance learning (MIL) in histopathology. The system automates end-to...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Generative Refocusing: Flexible Defocus Control from a Single Image

Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is sti...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference image...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Next-Embedding Prediction Makes Strong Vision Learners

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Inst...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps acros...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geomet...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] EasyV2V: A High-quality Instruction-based Video Editing Framework

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the desig...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interaction...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, pr...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] SFTok: Bridging the Performance Gap in Discrete Tokenizers

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact la...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text seq...

#research #paper #ai #nlp #computer-vision
4 months ago · ai · - · -

[Paper] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise o...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Training Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies

The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the s...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Spatia: Video Generation with Updatable Spatial Memory

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To ...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding a...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering

We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile dev...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Multi-View Foundation Models

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that i...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to comb...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent w...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected st...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Stylized Synthetic Augmentation further improves Corruption Robustness

This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-te...

#research #paper #ai #machine-learning #nlp #computer-vision
4 months ago · ai · - · -

[Paper] Human-like Working Memory from Artificial Intrinsic Plasticity Neurons

Working memory enables the brain to integrate transient information for rapid decision-making. Artificial networks typically replicate this via recurrent or par...

#research #paper #ai #machine-learning #computer-vision
4 months ago · software · - · -

The Hot-Reload Magic - Tweak Pipelines Live (No Restarts!)

Edit your config.toml while the app is running and watch the pipeline update instantly. No recompiling. No stopping the camera. Pure iteration bliss. Why This M...

#Go #GoCV #hot-reload #config.toml #fsnotify #computer-vision #pipeline #live-reload #OpenCV #devtools
4 months ago · ai · - · -

Data Annotation: Powering Accurate and Scalable AI Systems

Introduction Data annotation is a foundational process in artificial intelligence that enables machines to learn from real‑world data. It involves adding meani...

#data annotation #machine learning #training data #labeling #computer vision #natural language processing #speech recognition #AI model accuracy
4 months ago · ai · - · -

AI Background Remover: How AI Detects Objects and Separates Backgrounds

An AI background remover may feel like magic at first glance. You upload an image, click a button, and the background disappears. Behind that simple interaction...

#background removal #computer vision #image segmentation #machine learning #deep learning #AI tools

Newer posts

Older posts