[Paper] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Source: arXiv - 2511.21086v1
Overview
The paper Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models investigates how well today’s large language models (LLMs) can solve word‑puzzle tasks that demand strict character‑level constraints (e.g., “fill in the blanks” while preserving spelling). By testing 28 model configurations across three families—Qwen‑3, Claude Haiku‑4.5, and GPT‑5‑mini—the authors expose systematic architectural differences that dominate performance, far outweighing the gains from simply scaling model size.
Key Contributions
- Cross‑architecture benchmark: Introduced a 58‑puzzle suite that forces models to satisfy hard orthographic constraints, a setting rarely covered in standard LM evaluations.
- Large‑scale comparative study: Ran 28 configurations (three families, multiple parameter counts) to isolate the impact of architecture vs. parameter scaling.
- Quantified architectural advantage: Found a 2.0–2.2× performance gap (F1 = 0.761 vs. 0.343) between the best and worst families, dwarfing the 83 % gain from an eight‑fold parameter increase within a single family.
- Thinking‑budget analysis: Showed heterogeneous returns to larger “thinking budgets” (more inference steps); high‑capacity models improve (+0.102 – +0.136 F1) while mid‑size models plateau or even regress.
- Human‑difficulty calibration: Correlated model success with difficulty ratings collected from ~10 k human solvers (r = 0.24–0.38), revealing modest alignment but systematic blind spots on common words with atypical spelling.
- Error pattern discovery: Identified a class of failures where models over‑rely on distributional plausibility, missing orthographically valid solutions for words like “data”, “poop”, and “loll”.
Methodology
- Puzzle Construction – 58 word‑puzzle instances were crafted, each requiring the model to output a word that satisfies explicit character constraints (e.g., “_a_a” → “data”).
- Human Baseline – Each puzzle was solved by ~10 000 crowdworkers; the proportion of correct answers served as a difficulty score.
- Model Suite – Three LLM families were selected: Qwen‑3 (open‑source), Claude Haiku‑4.5 (Anthropic), and GPT‑5‑mini (OpenAI). For each family, four parameter scales were tested (≈0.5 B → 4 B), yielding 28 total configurations.
- Inference Budget – Models were prompted with varying “thinking budgets” (number of generated tokens before final answer) to assess sensitivity to compute allocation.
- Evaluation Metrics – Primary metric: F1 score on constraint satisfaction (exact match on required characters). Secondary analyses included correlation with human difficulty and per‑word error breakdown.
- Statistical Analysis – Pairwise architectural comparisons used bootstrap confidence intervals; correlation with human difficulty employed Pearson’s r.
Results & Findings
- Architectural dominance: Qwen‑3 models achieved the highest average F1 (0.761), while Claude Haiku‑4.5 lagged (0.343). The gap persisted across all parameter scales.
- Scaling effect: Within each family, moving from the smallest to the largest model improved F1 by ~0.08 (≈83 % relative gain), but this was modest compared to the cross‑family gap.
- Thinking budget: High‑capacity models (≥2 B parameters) benefitted from longer inference windows, gaining up to +0.136 F1. Mid‑size models (≈1 B) showed diminishing returns, sometimes dropping performance when the budget was increased.
- Human alignment: Model success correlated positively with human difficulty scores (r = 0.24–0.38), indicating that models are roughly sensitive to puzzle hardness but far from perfect.
- Systematic orthographic blind spots: For a subset of high‑frequency words with irregular spelling (“data”, “poop”, “loll”), human success exceeded 86 % while model miss rates ranged from 89 % to 96 %. Errors stem from the model favoring statistically common spelling patterns over the explicit constraint.
Practical Implications
- Tooling for constrained generation: Developers building autocomplete, code‑completion, or puzzle‑generation systems should not assume that larger models automatically handle strict character constraints; architecture matters more than raw size.
- Prompt engineering limits: Simple “think‑longer” tricks (e.g., increasing max tokens) only help high‑capacity models. For mid‑range models, developers may need to redesign prompts or add external validation loops.
- Hybrid pipelines: The identified failure modes suggest a practical architecture where an LLM proposes candidates and a lightweight orthographic validator (regex or finite‑state automaton) filters them, ensuring hard constraints are met.
- Domain‑specific fine‑tuning: Industries that rely on precise naming conventions (e.g., chemical nomenclature, product codes) could benefit from fine‑tuning on orthographically constrained datasets or adding auxiliary loss terms that penalize constraint violations.
- Benchmarking standards: The puzzle suite can serve as a quick sanity check for any new LLM before deployment in applications where spelling accuracy is mission‑critical (e.g., medical transcription, legal document drafting).
Limitations & Future Work
- Scope of puzzles: The benchmark focuses on short English words; extending to longer phrases, multilingual orthographies, or domain‑specific vocabularies would test generality.
- Model families: Only three families were examined; newer architectures (e.g., mixture‑of‑experts, retrieval‑augmented models) might behave differently.
- Training data bias: The analysis attributes failures to “distributional plausibility,” but does not isolate whether the issue lies in pre‑training corpora, tokenization, or decoding strategies.
- Human difficulty granularity: Difficulty scores are aggregated across many solvers; future work could explore individual differences (e.g., native vs. non‑native speakers) to refine calibration metrics.
- Architectural innovations: The authors suggest specialized components (e.g., constraint‑aware attention heads) but leave concrete designs to subsequent research.
Bottom line: When your product demands that a language model obeys hard spelling rules, picking the right architecture—and possibly augmenting it with explicit constraint checks—will matter far more than simply scaling up the number of parameters.
Authors
- Bryan E. Tuck
- Rakesh M. Verma
Paper Information
- arXiv ID: 2511.21086v1
- Categories: cs.CL
- Published: November 26, 2025
- PDF: Download PDF