[Paper] CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spa...