[Paper] Monet: Reasoning in Latent Visual Space Beyond Images and Language
'Thinking with images' has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evi...
'Thinking with images' has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evi...
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions....
Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns wi...
Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is...
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical da...
This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport i...
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operat...
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforc...