[Paper] DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

Published: (December 10, 2025 at 10:54 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09772v1

Overview

The paper investigates how large language models (LLMs) implicitly inherit cultural biases and how those biases can be nudged using the language of the prompt and a “cultural prompting” technique. By benchmarking popular models against Hofstede’s well‑known cultural dimensions, the authors reveal that many flagship LLMs are far more aligned with Western (U.S.) cultural norms than with Chinese ones—unless specific prompting tricks are applied.

Key Contributions

  • Cultural Benchmarking Framework – Adapts Hofstede’s VSM13 survey into a set of prompts that map LLM responses onto the six cultural dimensions (e.g., Individualism vs. Collectivism).
  • Cultural Prompting Strategy – Introduces a lightweight system‑prompt that explicitly tells the model to “think like a person from [Country]”, enabling on‑the‑fly cultural alignment without fine‑tuning.
  • Cross‑Model Survey – Evaluates DeepSeek‑V3/V3.1, OpenAI GPT‑4, GPT‑4o, GPT‑4.1, and the unreleased GPT‑5 across English and Simplified Chinese prompts.
  • Empirical Findings – Shows that GPT‑5 and DeepSeek models naturally mirror U.S. cultural scores, while only the newer GPT‑4 variants can be steered toward Chinese cultural profiles with the right prompt language or cultural prompt.
  • Open‑Source Toolkit – Releases the prompt sets and analysis scripts so other researchers and engineers can reproduce the cultural alignment tests on any LLM.

Methodology

  1. Survey‑Based Prompt Design – The authors translated each of Hofstede’s 13 survey items into a question that an LLM can answer (e.g., “When making a big decision, do you prefer to consult the group or rely on personal judgment?”).
  2. Prompt Language Variation – Each question was asked in both English and Simplified Chinese to see how the model’s language context influences its cultural stance.
  3. Cultural Prompting – A short system prompt (“You are a resident of [Country] and answer as a typical person from that country”) was prepended to the query set, creating three conditions: (a) baseline, (b) language‑only, (c) language + cultural prompt.
  4. Scoring – Model answers were mapped back to Hofstede’s numeric scales (0–100) using a rule‑based classifier, allowing direct comparison with the real‑world survey averages for the U.S. and China.
  5. Statistical Comparison – Pearson correlation and mean absolute error (MAE) measured how closely each model’s “cultural fingerprint” matched the target country’s profile.

Results & Findings

ModelBaseline Alignment (U.S.)Baseline Alignment (China)Effect of English PromptEffect of Chinese PromptEffect of Cultural Prompt
DeepSeek‑V3 / V3.1High (r≈0.78)Low (r≈0.32)Minimal shiftMinimal shiftNo significant change
GPT‑5 (unreleased)Very High (r≈0.84)Low (r≈0.28)Slight improvement for ChinaSlight improvement for ChinaNegligible
GPT‑4Moderate U.S. (r≈0.61)Higher China (r≈0.55) when prompted in EnglishImproves China alignmentImproves U.S. alignmentShifts toward U.S. (r≈0.70)
GPT‑4o / GPT‑4.1Balanced (r≈0.65 both)Balanced (r≈0.63 both)Language determines direction (English → U.S., Chinese → China)Same as English but oppositeStrongest shift (up to ±15 points on each dimension)

Takeaways

  • The most powerful models (GPT‑5, DeepSeek‑V3) are “culturally hard‑wired” toward Western norms, likely reflecting the predominance of English‑centric training data.
  • Prompt language alone can nudge a model, but the effect is modest for the biggest models.
  • The cultural prompting technique is surprisingly effective for the newer, cheaper GPT‑4 variants, enabling developers to flip a model’s cultural bias with a single system message.

Practical Implications

  • Global Product Localization – Teams building chatbots or virtual assistants can use cultural prompting to make the same model sound “local” to users in different regions without maintaining separate fine‑tuned models.
  • Bias Auditing Tools – The benchmark can be integrated into CI pipelines to flag unintended cultural drift when models are updated or retrained.
  • Regulatory Compliance – In jurisdictions where cultural sensitivity is legally mandated (e.g., content moderation in China), a simple system prompt may satisfy compliance checks more cheaply than full model retraining.
  • Developer Experience – The approach works at inference time, meaning no additional compute cost beyond the extra token for the system prompt—ideal for latency‑sensitive SaaS APIs.
  • Cross‑Cultural UX Research – Product designers can experiment with different cultural prompts to gauge user reactions, enabling rapid A/B testing of culturally tailored dialogues.

Limitations & Future Work

  • Survey Mapping Simplifications – Translating Hofstede’s Likert‑scale items into binary LLM answers introduces noise; a more nuanced scoring scheme could improve fidelity.
  • Model Scope – The study focuses on a handful of high‑profile LLMs; open‑source models (e.g., LLaMA, Mistral) remain untested.
  • Static Prompting – Cultural prompting is a one‑shot instruction; future work could explore dynamic, context‑aware cultural adaptation across multi‑turn conversations.
  • Cultural Granularity – Only U.S. vs. China were examined. Extending the framework to a broader set of cultures (e.g., India, Brazil) would validate generality.
  • Ethical Guardrails – Deliberately shifting cultural bias raises questions about manipulation and authenticity; the authors call for transparent disclosure when cultural prompting is used in production.

Authors

  • James Luther
  • Donald Brown

Paper Information

  • arXiv ID: 2512.09772v1
  • Categories: cs.CL
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »