[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Published: 3 weeks ago (April 17, 2026 at 01:33 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16275v1

Overview

The paper investigates how Large Language Models (LLMs) react to prompts that vary in politeness—from courteous to downright rude—across three languages (English, Hindi, Spanish) and five popular models. By systematically measuring changes in response quality, the authors show that tone isn’t a one‑size‑fits‑all factor: its impact depends on the language, the model, and the dialogue context.

Key Contributions

Cross‑lingual politeness benchmark (PLUM): a publicly released dataset of 1,500 human‑validated prompts covering five politeness levels in English, Hindi, and Spanish.
Large‑scale empirical study: 22,500 prompt‑response pairs evaluated on eight quality dimensions (coherence, clarity, depth, etc.).
Model‑specific tone sensitivity analysis: quantifies how each of the five examined LLMs (Gemini‑Pro, GPT‑4o Mini, Claude 3.7 Sonnet, DeepSeek‑Chat, Llama 3) reacts to polite vs. impolite inputs.
Hypothesis‑driven validation: tests six falsifiable predictions derived from classic politeness theory, providing a rigorous bridge between sociolinguistics and AI.
Actionable insights for developers: concrete recommendations on prompt phrasing per language and model to maximize response quality.

Methodology

Prompt Design – Using Brown & Levinson’s Politeness Theory and Culpeper’s Impoliteness Framework, the authors crafted five tone categories (e.g., deferential, direct, assertive, rude). Each category was translated into English, Hindi, and Spanish, yielding 1,500 unique prompts.
Interaction Histories – For every prompt, three dialogue contexts were simulated: a raw (no prior exchange), a polite history, and an impolite history, to capture how prior tone influences the next turn.
Model Sampling – The prompts were fed to five state‑of‑the‑art LLMs via their public APIs. Each model generated a response, resulting in 22,500 prompt‑response pairs.
Evaluation Framework – Human annotators rated each response on eight factors (coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, readability). Scores were normalized to produce an overall quality metric.
Statistical Analysis – The authors computed per‑model and per‑language effect sizes, ran ANOVA tests, and examined interaction effects between tone, language, and model. They also checked the six theoretical hypotheses against the empirical data.

Results & Findings

Politeness boosts quality, but not uniformly – Polite prompts improve average response quality by up to ~11 % overall, while impolite prompts can degrade it by a similar margin.
Language‑specific sweet spots
- English: courteous or neutral tones work best.
- Hindi: deferential and indirect tones yield higher scores.
- Spanish: assertive tones outperform the others.
Model‑level differences
- Llama 3 shows the greatest sensitivity (≈ 11.5 % quality swing between most polite and most rude inputs).
- GPT‑4o Mini remains relatively robust, with only ~3 % swing.
- Claude 3.7 Sonnet and Gemini‑Pro sit in the middle.
Dialogue history matters – A polite prior exchange can partially mitigate the negative impact of a rude prompt, and vice‑versa.
Hypotheses outcomes – Four of the six sociolinguistic hypotheses were supported (e.g., “deferential language improves compliance in high‑context languages”), while two were rejected, highlighting gaps in current theory when applied to LLMs.

Practical Implications

Prompt engineering guidelines – Developers can tailor prompts to the target language and model: use deferential phrasing for Hindi‑centric applications, keep it assertive for Spanish, and stick to neutral courtesy for English.
Safety and toxicity mitigation – Knowing that impolite inputs can increase toxic outputs (especially in models like Llama 3) helps teams design front‑ends that automatically re‑phrase or flag hostile user language.
Customer‑support bots – By feeding a polite interaction history, bots can maintain higher response quality even when users become frustrated, improving user satisfaction.
Multilingual product rollout – Companies can prioritize models that are tone‑robust for languages where they expect a wide range of user politeness (e.g., GPT‑4o Mini for English‑heavy markets).
Benchmarking & monitoring – The PLUM corpus offers a ready‑made test suite for continuous evaluation of new model releases or fine‑tuned variants.

Limitations & Future Work

Scope of languages – Only three languages were examined; results may not generalize to low‑resource or typologically distant languages.
Prompt diversity – While 1,500 prompts are sizable, they cover a limited set of domains (mostly informational queries). Real‑world conversational breadth could reveal different patterns.
Model versions – The study captures a snapshot of each model’s API at a single point in time; future updates may alter tone sensitivity.
Human annotation bias – Evaluators were native speakers but may still carry cultural biases that affect rating consistency.
Future directions – Extending PLUM to more languages, exploring tone effects in multimodal LLMs, and integrating automated politeness detectors into prompt‑preprocessing pipelines are suggested next steps.

Authors

Hitesh Mehta
Arjit Saxena
Garima Chhikara
Rohit Kumar

Paper Information

arXiv ID: 2604.16275v1
Categories: cs.CL
Published: April 17, 2026
PDF: Download PDF

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation