computer-vision — Page 24

Sort:

2 months ago · ai · - · -

[Paper] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and ...

#research #paper #ai #machine-learning #nlp #computer-vision
2 months ago · ai · - · -

[Paper] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coor...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built o...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thi...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Zero-shot Reconstruction of In-Scene Object Manipulation from Video

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed s...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and ground...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean im...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, ...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-b...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-la...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] MapTrace: Scalable Data Generation for Route Tracing on Maps

While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatia...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning

Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward iso...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Overview YOLOv6 is a new step in object detection designed for factories, stores, and cameras everywhere. Built by a team focused on speed and reliability, it...

#YOLOv6 #object detection #computer vision #real‑time AI #edge computing #industrial AI #open source
2 months ago · ai · - · -

[Paper] Point What You Mean: Visually Grounded Instruction Policy

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on t...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional a...

#research #paper #ai #machine-learning #computer-vision
2 months ago · ai · - · -

[Paper] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost ...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Exi...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

[Paper] Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)

Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task sem...

#research #paper #ai #computer-vision
2 months ago · ai · - · -

**Myth: Computer Vision is only effective for images and not

Myth: Computer Vision is only effective for images and not for videos. Reality: Computer Vision can handle both images and videos, thanks to advancements in tem...

#computer vision #video analysis #deep learning #temporal processing #AI myths
2 months ago · ai · - · -

[Paper] Application of deep learning approaches for medieval historical documents transcription

Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with La...

#research #paper #ai #machine-learning #nlp #computer-vision
2 months ago · ai · - · -

In Defense of the Triplet Loss for Person Re-Identification

Introduction Person re-identification re-ID is the task of finding the same individual across different camera views. It has important applications in security...

#triplet loss #person re-identification #computer vision #deep learning #metric learning #end-to-end training
2 months ago · ai · - · -

Improved Baselines with Momentum Contrastive Learning

Overview Teaching computers to recognize patterns without labeled data—known as unsupervised learning—has become more accessible thanks to simple tweaks to the...

#momentum contrast #MoCo #contrastive learning #unsupervised learning #data augmentation #baseline improvement #computer vision

Newer posts

Older posts