[논문] 갭에 주목: 비디오 인스턴스 세그멘테이션 성능 병목 해소
In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to perform...
In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to perform...
Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficul...
In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A R...
최근 3D 멀티모달 대형 언어 모델(3D-MLLMs)의 발전으로 시각 질문 응답을 포함한 3D 씬 이해 작업에 대한 통합 솔루션이 가능해졌다.
Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricat...
Vision-Language Models (VLMs)이 강력한 시각적 추론 능력을 보여주었지만, 그들의 공간 추론 능력은 여전히 주로 관찰에 제한되어 있다.
Multiple Instance Learning (MIL)은 인스턴스들의 bag 수준에서 감독이 제공되는 문제를 다루며, 다양한 분야에 성공적으로 적용되어 왔습니다.
Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice,...
Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-ba...
Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. Howev...
학습 기반 Scene Graph Generation (SGG) 모델은 빈번한 관계 유형에서는 뛰어나지만, 주석 희소성 하에서는 급격히 성능이 저하되어 신뢰할 수 있는 ...
Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separate...
Image-to-Video diffusion models는 입력 이미지를 활용하여 시각적으로 놀라운 콘텐츠를 생성하지만, 종종 물리 법칙을 위반하는 움직임을 만들어냅니다. 우리는 …
In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and t...
비디오 질문 응답(VideoQA)은 주어진 비디오에 대한 질문에 답하는 것을 목표로 합니다. 기존 접근 방식은 사실형 VideoQA에서는 뛰어나지만, 깊이 있는 비디오…
Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean ...
Diffusion Transformers (DiTs)를 기반으로 한 비디오 생성 모델은 비디오 합성에서 놀라운 성능을 달성했지만, 높은 추론 지연 시간으로 어려움을 겪고 있습니다.
Temporal Grounding (TG)은 텍스트 쿼리에 해당하는 비디오 세그먼트를 위치 지정하는 것을 목표로 합니다. 기존 연구는 주로 단일 세그먼트 검색에 초점을 맞추었습니다. 실제…
Robotic manipulation of textiles는 continuous deformation과 self-occlusions 때문에 estimate에 필요한 robust visual perception을 방해받아 여전히 어려운 과제이다.
Blind image restoration requires recovering clean images from observations corrupted by unknown and potentially mixed degradations. While recent deterministic f...
포인트 클라우드는 로봇 인식의 기본 감각 표현으로, LiDAR 기반 자율 주행, 동시 위치 추정 및 지도 작성(SLAM) 등을 뒷받침합니다.
Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multi...
Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extract...
Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP ...
We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity ...
Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-r...
Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Mos...
After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. Ho...
Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also l...
우리는 강력하지만 일반적인 비전 파운데이션 모델을 특수 과학 분야에 적용하기 위해 라벨이 없는 접근 방식을 제안한다. 표준 감독식 파인튜닝은 …
The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented sp...
Feed-forward 3D Gaussian Splatting 방법은 포즈가 지정된 이미지든 포즈가 없는 이미지든 단일 전방 패스로 장면을 재구성하지만, 현재 접근 방식은 하나의 Gauss...
Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visu...
Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standar...
This paper investigates 'free lunch' strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesi...
Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters ...
We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as lo...
Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Tran...
Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these ...
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and ...
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical ...
Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-c...
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire n...
Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments ...
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video ML...
Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points ...
Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classif...
Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification...