[Paper] Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

Published: 6 days ago (June 4, 2026 at 12:36 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06369v1

Overview

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model‑agnostic, semantically‑guided knowledge refinement framework that systematically mines commonsense‑grounded constraints from training data—capturing spatial, functional, and qualitative relational regularities—and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning‑based scene graph generation.

Key Contributions

Domain: cs.CV
The paper introduces a semantically‑guided knowledge refinement framework for SGG that:
- Mines commonsense constraints automatically from training data.
- Applies declarative reasoning to refine predictions at inference.
- Works model‑agnostically without retraining.
- Shows consistent gains across multiple benchmarks and architectures.

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of computer vision (cs.CV) by enhancing the robustness of scene graph generation under sparse annotations through commonsense reasoning.

Authors

Maëlic Neau
Salim Baloch
Jakob Suchan
Zoe Falomir
Mehul Bhatt

Paper Information

arXiv ID: 2606.06369v1
Categories: cs.CV
Published: June 4, 2026
PDF: Download PDF

[Paper] Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters