[Paper] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection
Source: arXiv - 2511.21066v1
Overview
Detecting sarcasm in text is still a tough nut for NLP systems, even with powerful pre‑trained language models (PLMs) and large language models (LLMs). This paper builds on a recent prompting technique called Pragmatic Metacognitive Prompting (PMP) and shows how adding contextual knowledge—both from the web and from the model’s own internal memory—can dramatically boost sarcasm‑detection performance across several benchmark datasets.
Key Contributions
- Context‑aware prompting: Introduces a retrieval‑aware extension to PMP that supplies external background information when the model lacks the needed cultural or domain knowledge.
- Self‑knowledge awareness: Proposes a “self‑knowledge” strategy that asks the LLM to surface relevant facts it already knows, reducing reliance on external retrieval.
- Empirical gains: Achieves up to +9.87 % macro‑F1 on an Indonesian Twitter sarcasm set and consistent improvements (≈3–4 % macro‑F1) on English‑language benchmarks (SemEval‑2018 Task 3, MUStARD).
- Open‑source pipeline: Releases code and data‑handling scripts, enabling reproducibility and easy integration into existing sarcasm‑detection workflows.
Methodology
- Base Prompt (PMP): The authors start with the existing Pragmatic Metacognitive Prompt, which frames sarcasm detection as a metacognitive reasoning task—asking the model to first consider the literal meaning, then the pragmatic (sarcastic) intent.
- Retrieval‑aware augmentation:
- Non‑parametric (web) retrieval: For each input sentence, a lightweight search engine fetches the top‑k web snippets that contain potentially relevant slang, cultural references, or obscure entities. These snippets are concatenated to the prompt as “background knowledge.”
- Self‑knowledge retrieval: The LLM is first queried with a meta‑prompt (“What facts do you know that could help interpret this sentence?”). Its own generated knowledge is then fed back into the main sarcasm‑detection prompt.
- Prompt composition: The final prompt consists of three parts—(a) the original PMP instruction, (b) the retrieved knowledge block, and (c) the target sentence.
- Evaluation: Experiments run on three public sarcasm corpora using GPT‑3.5‑style LLMs via the OpenAI API. Macro‑F1 is the primary metric, reflecting balanced performance across sarcastic and non‑sarcastic classes.
Results & Findings
| Dataset | Baseline PMP (macro‑F1) | +Non‑parametric retrieval | +Self‑knowledge retrieval |
|---|---|---|---|
| Twitter Indonesia Sarcastic | 62.3 % | 72.2 % (+9.87 %) | – |
| SemEval‑2018 Task 3 | 78.1 % | – | 81.4 % (+3.29 %) |
| MUStARD | 71.5 % | – | 75.6 % (+4.08 %) |
- Context matters: Adding web‑sourced background dramatically helps when the text contains region‑specific slang or references unknown to the LLM.
- Self‑knowledge is complementary: Even without external retrieval, prompting the model to surface its own facts yields consistent gains, especially on English datasets where the LLM already has broader coverage.
- Error analysis: Remaining failures often involve multi‑turn sarcasm or heavily ambiguous humor that requires deeper discourse modeling beyond single‑sentence context.
Practical Implications
- Better moderation tools: Social‑media platforms can integrate the retrieval‑aware PMP pipeline to flag sarcastic or potentially toxic content more reliably, reducing false positives caused by literal‑meaning misinterpretations.
- Cross‑cultural chatbots: Customer‑service bots deployed in multilingual markets (e.g., Indonesia) can use the web‑retrieval component to stay up‑to‑date with local slang, improving user experience and avoiding miscommunication.
- Low‑resource adaptation: Since the approach relies on plug‑and‑play retrieval rather than fine‑tuning massive models, developers can retrofit existing LLM‑based pipelines with minimal compute overhead.
- Explainability: The retrieved snippets are visible to developers, offering a transparent “why” behind a sarcasm prediction—useful for audit trails and compliance.
Limitations & Future Work
- Retrieval quality dependence: Noisy or irrelevant web snippets can hurt performance; the current system uses a simple BM25 ranker without sophisticated relevance feedback.
- Latency overhead: Real‑time applications must balance the extra API calls for retrieval against response time constraints.
- Scope of evaluation: Experiments focus on three datasets; broader testing on multi‑turn dialogues and other languages is needed.
- Future directions: The authors plan to explore neural re‑ranking of retrieved documents, adaptive prompt length control, and integration with multi‑modal cues (e.g., emojis, images) to capture sarcasm that spans text and visual context.
Authors
- Michael Iskandardinata
- William Christian
- Derwin Suhartono
Paper Information
- arXiv ID: 2511.21066v1
- Categories: cs.CL, cs.AI
- Published: November 26, 2025
- PDF: Download PDF