[Paper] MMGR: Multi-Modal Generative Reasoning
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture p...
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture p...
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details p...
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only ...
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through env...
Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficien...
Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole...
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable constr...
Article URL: https://alpr.watch/ Comments URL: https://news.ycombinator.com/item?id=46290916 Points: 224 Comments: 114...
Fresh off releasing the latest version of its Olmo foundation model, the Allen Institute for AI Ai2 launched its open-source video model, Molmo 2, on Tuesday, a...
AlphaFlow provides a smoother training schedule for MeanFlow image models, reducing the conflict between its two objectives and accelerating learning. Overview...
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the...
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains uncl...
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training p...
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to a...
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited sce...
Recent feed-forward reconstruction models like VGGT and π^3 achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memor...
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications,...
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical li...
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolu...
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm i...
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We intro...
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of inte...
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transfo...
Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown ...