[Paper] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. W...
Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. W...
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this ...
Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Predicti...
MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D r...
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a ...
When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A...
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and ad...
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. Ho...
Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vi...
Apply the best methods from academia to get the most out of practical applications The post How to Improve the Performance of Visual Anomaly Detection Models ap...
Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature represent...
🍝 From Pixels to Calories – Multimodal AI & Automated Calorie Tracking We’ve all been there: staring at a delicious plate of pasta, trying to figure out if it...
Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamic...
Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or ...
Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide imag...
Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models i...
Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current meth...
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, po...
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embo...
Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep l...
GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents...
Automated blood morphology analysis can support hematological diagnostics in low- and middle-income countries (LMICs) but remains sensitive to dataset shifts fr...
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoni...
Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrins...