Improved Baselines with Momentum Contrastive Learning
Overview Teaching computers to recognize patterns without labeled data—known as unsupervised learning—has become more accessible thanks to simple tweaks to the...
Overview Teaching computers to recognize patterns without labeled data—known as unsupervised learning—has become more accessible thanks to simple tweaks to the...
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level re...
Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from...
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...
With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could...
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans natura...
We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-...
Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical....
A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK...
Modern diffusion models (DMs) have achieved state-of-the-art image generation. However, the fundamental design choice of diffusing data all the way to white noi...
Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an i...
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can...