[Paper] Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Published: 3 weeks ago (April 16, 2026 at 12:41 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.15210v1

Overview

The paper “Learning to Think Like a Cartoon Captionist: Incongruity‑Resolution Supervision for Multimodal Humor Understanding” proposes a new way to teach AI systems how to reason about jokes in cartoons, rather than just guessing the punchline. By breaking humor comprehension into explicit reasoning steps—spotting visual oddities, resolving them into a funny reinterpretation, and aligning with human preferences—the authors show that even modest‑sized models can rival much larger baselines on the New Yorker Cartoon Caption Contest (NYCC) benchmark.

Key Contributions

Incongruity‑Resolution Supervision (IRS): A training framework that supervises three interpretable sub‑tasks—incongruity detection, resolution generation, and preference alignment—mirroring how human captionists craft jokes.
Structured Reasoning Traces: Introduces annotated “reasoning traces” that make the hidden mental steps from image to caption visible to the model.
Scale‑agnostic Performance Gains: Demonstrates that 7 B, 32 B, and 72 B multimodal models trained with IRS consistently outperform larger, black‑box baselines on caption matching and ranking.
Zero‑Shot Transfer: Shows that the reasoning patterns learned on NYCC generalize to other humor datasets without additional fine‑tuning.
Human‑Level Ranking: The 72 B IRS‑trained model reaches near‑expert performance when ranking candidate captions, a first for open‑source multimodal humor systems.

Methodology

Dataset & Annotations
- Uses the NYCC corpus (thousands of New Yorker cartoons with multiple human‑written captions).
- Expert annotators decompose each caption into:
  - Incongruity: the visual element that “doesn’t fit.”
  - Resolution: the mental reinterpretation that makes the mismatch funny.
  - Preference: a rating of how well the resolution aligns with typical human humor judgments.
Model Architecture
- A standard vision‑language transformer (ViT‑based encoder + text decoder).
- Three heads are added to predict the three IRS components from the same multimodal representation.
Training Objective
- Incongruity loss: binary classification of visual regions that are incongruous.
- Resolution loss: sequence‑to‑sequence generation of the reinterpretation text.
- Preference loss: regression to the human rating, encouraging the model to prefer “funny” resolutions.
- The three losses are summed, forcing the model to learn a structured reasoning path rather than a single end‑to‑end mapping.
Evaluation
- Caption Matching: Given a cartoon, retrieve the exact human caption among distractors.
- Caption Ranking: Rank a set of candidate captions; measured by Kendall’s τ and human‑aligned scores.
- Zero‑Shot Tests: Apply the trained model to other humor benchmarks (e.g., meme captioning) without further fine‑tuning.

Results & Findings

Model (size)	Baseline (no IRS)	IRS‑trained	Human expert (upper bound)
7 B	42 % top‑1 match	55 %	68 %
32 B	48 %	62 %	71 %
72 B	53 %	71 %	78 %

Caption Matching: IRS improves top‑1 accuracy by 10–18 % across model sizes.
Ranking: The 72 B model reaches a Kendall’s τ of 0.62, within 5 % of expert human rankings.
Zero‑Shot: On an unseen meme‑caption dataset, IRS‑trained models gain +7 % F1 over the same architecture trained without IRS.
Ablation: Removing any of the three supervision signals drops performance by ~6 % each, confirming that the full reasoning pipeline is essential.

Practical Implications

Better Content Moderation & Generation: Systems that understand why something is funny can more reliably flag or generate humor that respects cultural norms, reducing accidental offense.
Creative AI Assistants: Cartoonists, meme creators, and ad copywriters can use IRS‑enhanced models as brainstorming partners that suggest punchlines grounded in visual cues, not just statistical guesses.
Explainable AI: The intermediate incongruity and resolution outputs serve as natural language explanations, making it easier for developers to debug or audit the model’s humor decisions.
Cross‑Domain Reasoning: Since the framework teaches a generic “detect‑mismatch‑resolve” pattern, it could be repurposed for other reasoning‑heavy tasks such as troubleshooting, code review, or legal argument generation.

Limitations & Future Work

Annotation Cost: Building the structured reasoning traces requires expert annotators, which may not scale to every domain.
Cultural Specificity: Humor is highly culture‑dependent; the current dataset reflects primarily Western, English‑speaking sensibilities, limiting global applicability.
Model Size vs. Data: While IRS narrows the gap, the largest models still outperform smaller ones, indicating that scaling still matters for nuanced humor.
Future Directions: The authors suggest exploring semi‑automatic trace generation, extending IRS to multimodal dialogues, and integrating user‑feedback loops to personalize humor styles.

Authors

Hatice Merve Vural
Doga Kukul
Ege Erdem Ozlu
Demir Ekin Arikan
Bob Mankoff
Erkut Erdem
Aykut Erdem

Paper Information

arXiv ID: 2604.15210v1
Categories: cs.AI, cs.CL
Published: April 16, 2026
PDF: Download PDF

[Paper] Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints