[Paper] Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding
Source: arXiv - 2604.15210v1
Overview
The paper “Learning to Think Like a Cartoon Captionist: Incongruity‑Resolution Supervision for Multimodal Humor Understanding” proposes a new way to teach AI systems how to reason about jokes in cartoons, rather than just guessing the punchline. By breaking humor comprehension into explicit reasoning steps—spotting visual oddities, resolving them into a funny reinterpretation, and aligning with human preferences—the authors show that even modest‑sized models can rival much larger baselines on the New Yorker Cartoon Caption Contest (NYCC) benchmark.
Key Contributions
- Incongruity‑Resolution Supervision (IRS): A training framework that supervises three interpretable sub‑tasks—incongruity detection, resolution generation, and preference alignment—mirroring how human captionists craft jokes.
- Structured Reasoning Traces: Introduces annotated “reasoning traces” that make the hidden mental steps from image to caption visible to the model.
- Scale‑agnostic Performance Gains: Demonstrates that 7 B, 32 B, and 72 B multimodal models trained with IRS consistently outperform larger, black‑box baselines on caption matching and ranking.
- Zero‑Shot Transfer: Shows that the reasoning patterns learned on NYCC generalize to other humor datasets without additional fine‑tuning.
- Human‑Level Ranking: The 72 B IRS‑trained model reaches near‑expert performance when ranking candidate captions, a first for open‑source multimodal humor systems.
Methodology
-
Dataset & Annotations
- Uses the NYCC corpus (thousands of New Yorker cartoons with multiple human‑written captions).
- Expert annotators decompose each caption into:
- Incongruity: the visual element that “doesn’t fit.”
- Resolution: the mental reinterpretation that makes the mismatch funny.
- Preference: a rating of how well the resolution aligns with typical human humor judgments.
-
Model Architecture
- A standard vision‑language transformer (ViT‑based encoder + text decoder).
- Three heads are added to predict the three IRS components from the same multimodal representation.
-
Training Objective
- Incongruity loss: binary classification of visual regions that are incongruous.
- Resolution loss: sequence‑to‑sequence generation of the reinterpretation text.
- Preference loss: regression to the human rating, encouraging the model to prefer “funny” resolutions.
- The three losses are summed, forcing the model to learn a structured reasoning path rather than a single end‑to‑end mapping.
-
Evaluation
- Caption Matching: Given a cartoon, retrieve the exact human caption among distractors.
- Caption Ranking: Rank a set of candidate captions; measured by Kendall’s τ and human‑aligned scores.
- Zero‑Shot Tests: Apply the trained model to other humor benchmarks (e.g., meme captioning) without further fine‑tuning.
Results & Findings
| Model (size) | Baseline (no IRS) | IRS‑trained | Human expert (upper bound) |
|---|---|---|---|
| 7 B | 42 % top‑1 match | 55 % | 68 % |
| 32 B | 48 % | 62 % | 71 % |
| 72 B | 53 % | 71 % | 78 % |
- Caption Matching: IRS improves top‑1 accuracy by 10–18 % across model sizes.
- Ranking: The 72 B model reaches a Kendall’s τ of 0.62, within 5 % of expert human rankings.
- Zero‑Shot: On an unseen meme‑caption dataset, IRS‑trained models gain +7 % F1 over the same architecture trained without IRS.
- Ablation: Removing any of the three supervision signals drops performance by ~6 % each, confirming that the full reasoning pipeline is essential.
Practical Implications
- Better Content Moderation & Generation: Systems that understand why something is funny can more reliably flag or generate humor that respects cultural norms, reducing accidental offense.
- Creative AI Assistants: Cartoonists, meme creators, and ad copywriters can use IRS‑enhanced models as brainstorming partners that suggest punchlines grounded in visual cues, not just statistical guesses.
- Explainable AI: The intermediate incongruity and resolution outputs serve as natural language explanations, making it easier for developers to debug or audit the model’s humor decisions.
- Cross‑Domain Reasoning: Since the framework teaches a generic “detect‑mismatch‑resolve” pattern, it could be repurposed for other reasoning‑heavy tasks such as troubleshooting, code review, or legal argument generation.
Limitations & Future Work
- Annotation Cost: Building the structured reasoning traces requires expert annotators, which may not scale to every domain.
- Cultural Specificity: Humor is highly culture‑dependent; the current dataset reflects primarily Western, English‑speaking sensibilities, limiting global applicability.
- Model Size vs. Data: While IRS narrows the gap, the largest models still outperform smaller ones, indicating that scaling still matters for nuanced humor.
- Future Directions: The authors suggest exploring semi‑automatic trace generation, extending IRS to multimodal dialogues, and integrating user‑feedback loops to personalize humor styles.
Authors
- Hatice Merve Vural
- Doga Kukul
- Ege Erdem Ozlu
- Demir Ekin Arikan
- Bob Mankoff
- Erkut Erdem
- Aykut Erdem
Paper Information
- arXiv ID: 2604.15210v1
- Categories: cs.AI, cs.CL
- Published: April 16, 2026
- PDF: Download PDF