[Paper] Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
Source: arXiv - 2604.24665v1
Overview
This study asks a surprisingly practical question: Do large language models (LLMs) understand that the trustworthiness of a source influences how Turkish speakers encode evidence in grammar? By pairing a classic psycholinguistic experiment with systematic LLM testing, the authors reveal a clear gap between human speakers and current AI—highlighting a blind spot that could affect any application that relies on nuanced language understanding.
Key Contributions
- Human baseline: Demonstrates that native Turkish speakers systematically shift between two past‑tense suffixes (‑DI vs. ‑mIs) depending on whether the information source is high‑trust or low‑trust.
- LLM evaluation framework: Introduces three prompting styles (open cloze, explicit past‑tense cloze, forced‑choice) to probe evidential reasoning across ten popular LLMs.
- Trust‑sensitivity analysis: Shows that only a few models exhibit weak, inconsistent trust‑driven effects, while most default to surface‑level suffix frequencies.
- Error taxonomy: Identifies common failure modes—prompt‑sensitivity, compliance issues, and strong base‑rate suffix preferences—that mask any genuine evidential reasoning.
- Open‑source resources: Releases the stimulus set, human response data, and evaluation scripts for reproducibility and future benchmarking.
Methodology
- Stimuli design: Crafted 120 Turkish cloze sentences where the missing verb must be filled with either the evidential suffix ‑DI (neutral/committed) or ‑mIs (source‑sensitive). The only manipulation was the perceived reliability of an overtly mentioned information source (e.g., “the reputable news agency” vs. “a rumor”).
- Human experiment: 60 native speakers completed a production task, typing the appropriate verb form. Responses were analyzed for the proportion of ‑DI vs. ‑mIs across trust conditions.
- LLM testing: The same items were fed to ten LLMs (including GPT‑4, Llama 2, Claude, etc.) under three prompting regimes:
- Open gap‑fill: “… ___” (model must generate the full verb).
- Explicit past‑tense gap‑fill: “… (past tense) ___”.
- Forced‑choice A/B: “Select the more appropriate form: A) …‑DI or B) …‑mIs”.
- Analysis: Computed trust‑effect size (difference in suffix choice between high‑ and low‑trust contexts) and compared it to the human baseline. Also measured compliance (did the model obey the prompt) and base‑rate bias (overall preference for one suffix).
Results & Findings
- Human data: High‑trust contexts produced ≈68 % ‑DI, while low‑trust contexts dropped to ≈42 % ‑DI—a robust, statistically significant trust effect.
- LLM behavior:
- GPT‑4 showed a tiny, opposite‑direction shift (‑DI increased in low‑trust) and only when forced‑choice prompts were used.
- Llama 2‑Chat displayed a marginally correct shift under the explicit past‑tense prompt, but the effect vanished with open cloze.
- Most other models (Claude, Mistral, Gemma, etc.) ignored the trust cue entirely, defaulting to the more frequent suffix (‑DI ~70 % of the time).
- Prompt dependence: The same model could flip its behavior across prompting styles, indicating that “understanding” is more about pattern matching to the prompt than genuine evidential reasoning.
- Error patterns: Frequent issues included generating unrelated words, refusing to fill the gap, or consistently picking the suffix with the highest overall frequency regardless of context.
Practical Implications
- NLP pipelines for Turkish: Systems that need to preserve or generate evidential nuances—e.g., automated journalism, legal document drafting, or sentiment analysis—cannot rely on off‑the‑shelf LLMs to respect source trust cues.
- Prompt engineering limits: Simply re‑phrasing a prompt won’t guarantee that a model will incorporate pragmatic information like source reliability; developers must design task‑specific fine‑tuning or retrieval‑augmented approaches.
- Evaluation benchmarks: The paper’s benchmark can be repurposed as a sanity check for any multilingual LLM that claims “pragmatic awareness,” helping product teams catch hidden biases before deployment.
- Human‑in‑the‑loop workflows: For high‑stakes domains (e.g., medical advice translation), a fallback to rule‑based or hybrid models may be necessary until LLMs can reliably handle source‑sensitive morphology.
Limitations & Future Work
- Language scope: The study focuses solely on Turkish evidential morphology; results may not generalize to other languages with different evidential systems.
- Model diversity: Only ten publicly available LLMs were tested; newer or proprietary models could behave differently.
- Prompt granularity: While three prompting styles were explored, more nuanced instruction tuning (e.g., chain‑of‑thought, few‑shot examples) might yield stronger trust sensitivity.
- Future directions: Extending the benchmark to other pragmatic phenomena (e.g., politeness, modality), incorporating few‑shot fine‑tuning, and investigating retrieval‑augmented generation as a way to inject source reliability information.
Authors
- Sercan Karakaş
- Yusuf Şimşek
Paper Information
- arXiv ID: 2604.24665v1
- Categories: cs.CL, cs.AI
- Published: April 27, 2026
- PDF: Download PDF