computer-vision — Page 20

Sort:

3 months ago · ai · - · -

[Paper] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of fo...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology,...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] Patch-Discontinuity Mining for Generalized Deepfake Detection

The rapid advancement of generative artificial intelligence has enabled the creation of highly realistic fake facial images, posing serious threats to personal ...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] SketchPlay: Intuitive Creation of Physically Realistic VR Content with Gesture-Driven Sketching

Creating physically realistic content in VR often requires complex modeling tools or predefined 3D models, textures, and animations, which present significant b...

#research #paper #ai #computer-vision
3 months ago · ai · - · -

[Paper] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in view...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

WiFi DensePose: WiFi-based dense human pose estimation system through walls

Article URL: https://github.com/ruvnet/wifi-densepose Comments URL: https://news.ycombinator.com/item?id=46388904 Points: 10 Comments: 1...

#WiFi #DensePose #human pose estimation #computer vision #through walls #deep learning #open-source #research
4 months ago · ai · - · -

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

LAION-400M is a giant public resource designed to spark new ideas. It consists of about 400 million images paired with short captions, cleaned and CLIP‑filtered...

#LAION-400M #image-text dataset #CLIP-filtered #multimodal AI #open data #machine learning #computer vision
4 months ago · ai · - · -

AutoAugment: Learning Augmentation Policies from Data

Overview AutoAugment is a method that automatically discovers effective image augmentation policies. By systematically testing many simple transformations—such...

#autoaugment #data augmentation #computer vision #image classification #machine learning #deep learning #neural networks
4 months ago · ai · - · -

[Paper] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, mak...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared ...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Streaming Video Instruction Tuning

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narro...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Fast SAM2 with Text-Driven Token Pruning

Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representat...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the obse...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, ex...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have be...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation id...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

In hard-label black-box adversarial attacks, where only the top-1 predicted label is accessible, the prohibitive query complexity poses a major obstacle to prac...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] SemanticGen: Video Generation in Semantic Space

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While th...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] SpatialTree: How Spatial Abilities Branch Out in MLLMs

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierar...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Active Intelligence in Video Avatars via Closed-loop World Modeling

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term g...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] FedPOD: the deployable units of training for federated learning

This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multip...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Repurposing Video Diffusion Transformers for Robust Point Tracking

Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Exis...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark dec...

#research #paper #ai #machine-learning #nlp #computer-vision
4 months ago · ai · - · -

[Paper] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and su...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object g...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Snapshot 3D image projection using a diffractive decoder

3D image display is essential for next-generation volumetric imaging; however, dense depth multiplexing for 3D image projection remains challenging because diff...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and ...

#research #paper #ai #machine-learning #nlp #computer-vision
4 months ago · ai · - · -

[Paper] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coor...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built o...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thi...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Zero-shot Reconstruction of In-Scene Object Manipulation from Video

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed s...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and ground...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean im...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, ...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-b...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-la...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] MapTrace: Scalable Data Generation for Route Tracing on Maps

While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatia...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning

Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward iso...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Overview YOLOv6 is a new step in object detection designed for factories, stores, and cameras everywhere. Built by a team focused on speed and reliability, it...

#YOLOv6 #object detection #computer vision #real‑time AI #edge computing #industrial AI #open source
4 months ago · ai · - · -

[Paper] Point What You Mean: Visually Grounded Instruction Policy

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on t...

#research #paper #ai #computer-vision
4 months ago · ai · - · -

[Paper] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional a...

#research #paper #ai #machine-learning #computer-vision
4 months ago · ai · - · -

[Paper] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost ...

#research #paper #ai #computer-vision

Newer posts

Older posts