[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Published: (April 17, 2026 at 01:33 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16275v1

Overview

The paper investigates how Large Language Models (LLMs) react to prompts that vary in politeness—from courteous to downright rude—across three languages (English, Hindi, Spanish) and five popular models. By systematically measuring changes in response quality, the authors show that tone isn’t a one‑size‑fits‑all factor: its impact depends on the language, the model, and the dialogue context.

Key Contributions

  • Cross‑lingual politeness benchmark (PLUM): a publicly released dataset of 1,500 human‑validated prompts covering five politeness levels in English, Hindi, and Spanish.
  • Large‑scale empirical study: 22,500 prompt‑response pairs evaluated on eight quality dimensions (coherence, clarity, depth, etc.).
  • Model‑specific tone sensitivity analysis: quantifies how each of the five examined LLMs (Gemini‑Pro, GPT‑4o Mini, Claude 3.7 Sonnet, DeepSeek‑Chat, Llama 3) reacts to polite vs. impolite inputs.
  • Hypothesis‑driven validation: tests six falsifiable predictions derived from classic politeness theory, providing a rigorous bridge between sociolinguistics and AI.
  • Actionable insights for developers: concrete recommendations on prompt phrasing per language and model to maximize response quality.

Methodology

  1. Prompt Design – Using Brown & Levinson’s Politeness Theory and Culpeper’s Impoliteness Framework, the authors crafted five tone categories (e.g., deferential, direct, assertive, rude). Each category was translated into English, Hindi, and Spanish, yielding 1,500 unique prompts.
  2. Interaction Histories – For every prompt, three dialogue contexts were simulated: a raw (no prior exchange), a polite history, and an impolite history, to capture how prior tone influences the next turn.
  3. Model Sampling – The prompts were fed to five state‑of‑the‑art LLMs via their public APIs. Each model generated a response, resulting in 22,500 prompt‑response pairs.
  4. Evaluation Framework – Human annotators rated each response on eight factors (coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, readability). Scores were normalized to produce an overall quality metric.
  5. Statistical Analysis – The authors computed per‑model and per‑language effect sizes, ran ANOVA tests, and examined interaction effects between tone, language, and model. They also checked the six theoretical hypotheses against the empirical data.

Results & Findings

  • Politeness boosts quality, but not uniformly – Polite prompts improve average response quality by up to ~11 % overall, while impolite prompts can degrade it by a similar margin.
  • Language‑specific sweet spots
    • English: courteous or neutral tones work best.
    • Hindi: deferential and indirect tones yield higher scores.
    • Spanish: assertive tones outperform the others.
  • Model‑level differences
    • Llama 3 shows the greatest sensitivity (≈ 11.5 % quality swing between most polite and most rude inputs).
    • GPT‑4o Mini remains relatively robust, with only ~3 % swing.
    • Claude 3.7 Sonnet and Gemini‑Pro sit in the middle.
  • Dialogue history matters – A polite prior exchange can partially mitigate the negative impact of a rude prompt, and vice‑versa.
  • Hypotheses outcomes – Four of the six sociolinguistic hypotheses were supported (e.g., “deferential language improves compliance in high‑context languages”), while two were rejected, highlighting gaps in current theory when applied to LLMs.

Practical Implications

  • Prompt engineering guidelines – Developers can tailor prompts to the target language and model: use deferential phrasing for Hindi‑centric applications, keep it assertive for Spanish, and stick to neutral courtesy for English.
  • Safety and toxicity mitigation – Knowing that impolite inputs can increase toxic outputs (especially in models like Llama 3) helps teams design front‑ends that automatically re‑phrase or flag hostile user language.
  • Customer‑support bots – By feeding a polite interaction history, bots can maintain higher response quality even when users become frustrated, improving user satisfaction.
  • Multilingual product rollout – Companies can prioritize models that are tone‑robust for languages where they expect a wide range of user politeness (e.g., GPT‑4o Mini for English‑heavy markets).
  • Benchmarking & monitoring – The PLUM corpus offers a ready‑made test suite for continuous evaluation of new model releases or fine‑tuned variants.

Limitations & Future Work

  • Scope of languages – Only three languages were examined; results may not generalize to low‑resource or typologically distant languages.
  • Prompt diversity – While 1,500 prompts are sizable, they cover a limited set of domains (mostly informational queries). Real‑world conversational breadth could reveal different patterns.
  • Model versions – The study captures a snapshot of each model’s API at a single point in time; future updates may alter tone sensitivity.
  • Human annotation bias – Evaluators were native speakers but may still carry cultural biases that affect rating consistency.
  • Future directions – Extending PLUM to more languages, exploring tone effects in multimodal LLMs, and integrating automated politeness detectors into prompt‑preprocessing pipelines are suggested next steps.

Authors

  • Hitesh Mehta
  • Arjit Saxena
  • Garima Chhikara
  • Rohit Kumar

Paper Information

  • arXiv ID: 2604.16275v1
  • Categories: cs.CL
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »