[Paper] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment
We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. G...
We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. G...
The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reaso...
The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that ...
Introduction A San Francisco‑based startup claims to be the first to create a biological computing platform built from living neuronshttps://www.tomshardware.c...
The 15‑Year‑Old Code That Still Runs in Production Haar Cascades are everywhere. If you've ever used OpenCV's face detector, you've used a method published in...
Overview This spring, a Southern California beach town will become the first city in the country where municipal parking‑enforcement vehicles use an AI system...
The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here,...
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categor...
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods ...
Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue...
To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Int...
Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They t...
With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition,...
Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and...
This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attenti...
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible...
As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical...
Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel ...
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iterati...
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both...
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primaril...
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training co...
High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. ...
Waymo will begin fully autonomous operations with its 6th‑generation Driver — an important step in bringing our technology to more riders in more cities. This l...
Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intr...
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. V...
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, w...
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment...
We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods success...
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existin...
Biometric footstep recognition, based on a person's unique pressure patterns under their feet during walking, is an emerging field with growing applications in ...
We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational effi...
Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existin...
Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations fr...
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained ob...
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from u...
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work prese...
Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail t...
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they r...
Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including vide...
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geo...
We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions....
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confide...
Industrial Scale Deepfake Fraud Deepfake fraud has gone “industrial,” according to an analysis published by AI experts. Tools to create tailored, even personal...
Olympic figure skating looks effortless. Athletes sail across the ice, then soar into the air, spinning like a top, before landing on a single blade just 4‑5 mm...
Why We Need CNNs In this article, we will explore image classification using convolutional neural networks. For this, we will use a simple example: X or an O....
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continu...
This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enablin...