computer-vision — Page 16

Sort:

2 months ago · ai · - · -

Gemini 3 Flash’s new ‘Agentic Vision’ improves image responses

Agentic Vision is a new capability for the Gemini 3 Flash model to make image-related tasks more accurate by “grounding answers in visual evidence.” more…...

#Gemini 3 Flash #Agentic Vision #multimodal AI #computer vision #Google AI
2 months ago · ai · - · -

[Paper] SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification

Out-of-distribution (OOD) detection is a fundamental requirement for the reliable deployment of artificial intelligence applications in open-world environments....

#out-of-distribution detection #semantic nested dichotomy #computer vision #architectural style classification
2 months ago · ai · - · -

[Paper] Low Cost, High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning

Localization in agricultural environments is challenging due to their unstructured nature and lack of distinctive landmarks. Although agricultural settings have...

#LiDAR #place recognition #computer vision #representation learning #agricultural robotics
2 months ago · ai · - · -

[Paper] Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge

Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitab...

#text-to-video #geographic bias #computer vision #benchmark #evaluation
2 months ago · ai · - · -

[Paper] A Pragmatic VLA Foundation Model

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and ...

#vision-language-action #robotics #foundation-model #computer-vision #machine-learning
2 months ago · ai · - · -

[Paper] Counterfactual Explanations on Robust Perceptual Geodesics

Latent-space optimization methods for counterfactual explanations - framed as minimal semantic perturbations that change model predictions - inherit the ambigui...

#counterfactual explanations #perceptual geodesics #computer vision #machine learning #robustness
2 months ago · ai · - · -

[Paper] Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting

Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search

Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public saf...

#text-based person search #cross-modal retrieval #computer vision #query enhancement #optimal transport
3 months ago · ai · - · -

Get Started With Image Classification in Kaggle using Python

markdown !Cover image for Get Started With Image Classification in Kaggle using Pythonhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravit...

#image classification #Kaggle #Python #machine learning #deep learning #computer vision
3 months ago · ai · - · -

The Right Way to Measure Axiomatic Non-Sensitivity in XAI

The Right Way to Measure Axiomatic Non‑Sensitivity Why your XAI metric might lie to you — and how we fixed it If you’ve ever tried to actually measure how stab...

#XAI #explainability #non-sensitivity #attribution maps #AIXPlainer #metric evaluation #deep learning #computer vision
3 months ago · ai · - · -

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such co...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metric...

#research #paper #ai #nlp #computer-vision
3 months ago · ai · - · -

[Paper] Generative Scenario Rollouts for End-to-End Autonomous Driving

Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly re...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and compu...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases ...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improv...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metr...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision a...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

From RGB to Lab: Addressing Color Artifacts in AI Image Compositing

A multi-tier approach to segmentation, color correction, and domain-specific enhancement The post From RGB to Lab: Addressing Color Artifacts in AI Image Compos...

#image compositing #color correction #RGB #Lab color space #segmentation #computer vision #deep learning #AI image processing
3 months ago · ai · - · -

[Paper] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic conten...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Alterbute: Editing Intrinsic Attributes of Objects in Images

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausib...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effecti...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming

Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video qual...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approache...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Action100M: A Large-scale Video Action Dataset

Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this require...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] MHub.ai: A Simple, Standardized, and Reproducible Platform for AI Models in Medical Imaging

Artificial intelligence (AI) has the potential to transform medical imaging by automating image analysis and accelerating clinical research. However, research a...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

New Apple model combines vision understanding and image generation with impressive results

Apple researchers have published a study about Manzano, a multimodal model that combines visual understanding and text-to-image generation, while significantly...

#Apple #multimodal AI #vision-language model #text-to-image generation #Manzano #computer vision #generative AI #AI research
3 months ago · ai · - · -

[Paper] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on ...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3

Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its origina...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interact...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] LLMs can Compress LLMs: Adaptive Pruning by Agents

As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving perf...

#research #paper #ai #machine-learning #nlp #computer-vision
3 months ago · ai · - · -

[Paper] STEP3-VL-10B Technical Report

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Self-Supervised Animal Identification for Long Videos

Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods r...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] LiteEmbed: Adapting CLIP to Rare Classes

Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, includin...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Image2Garment: Simulation-ready Garment Generation from a Single Image

Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed n...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Identifying Models Behind Text-to-Image Leaderboards

Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards ...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

[Paper] PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex impl...

#research #paper #ai #machine-learning #computer-vision
3 months ago · ai · - · -

Battle of the CNNs: ResNet vs. MobileNet vs. EfficientNet for Fruit Disease Detection

Introduction I’ve always been fascinated by how deep learning can solve real‑world problems, and fruit disease detection seemed like the perfect challenge—not...

#fruit disease detection #ResNet #MobileNet #EfficientNet #deep learning #computer vision #image classification #agricultural AI
3 months ago · ai · - · -

[Paper] RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at sca...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] 3AM: Segment Anything with Geometric Consistency in Videos

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reli...

#research #paper #ai #computer-vision

Newer posts

Older posts