[Paper] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Published: 2 months ago (November 26, 2025 at 01:12 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21086v1

Overview

The paper Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models investigates how well today’s large language models (LLMs) can solve word‑puzzle tasks that demand strict character‑level constraints (e.g., “fill in the blanks” while preserving spelling). By testing 28 model configurations across three families—Qwen‑3, Claude Haiku‑4.5, and GPT‑5‑mini—the authors expose systematic architectural differences that dominate performance, far outweighing the gains from simply scaling model size.

Key Contributions

Cross‑architecture benchmark: Introduced a 58‑puzzle suite that forces models to satisfy hard orthographic constraints, a setting rarely covered in standard LM evaluations.
Large‑scale comparative study: Ran 28 configurations (three families, multiple parameter counts) to isolate the impact of architecture vs. parameter scaling.
Quantified architectural advantage: Found a 2.0–2.2× performance gap (F1 = 0.761 vs. 0.343) between the best and worst families, dwarfing the 83 % gain from an eight‑fold parameter increase within a single family.
Thinking‑budget analysis: Showed heterogeneous returns to larger “thinking budgets” (more inference steps); high‑capacity models improve (+0.102 – +0.136 F1) while mid‑size models plateau or even regress.
Human‑difficulty calibration: Correlated model success with difficulty ratings collected from ~10 k human solvers (r = 0.24–0.38), revealing modest alignment but systematic blind spots on common words with atypical spelling.
Error pattern discovery: Identified a class of failures where models over‑rely on distributional plausibility, missing orthographically valid solutions for words like “data”, “poop”, and “loll”.

Methodology

Puzzle Construction – 58 word‑puzzle instances were crafted, each requiring the model to output a word that satisfies explicit character constraints (e.g., “_a_a” → “data”).
Human Baseline – Each puzzle was solved by ~10 000 crowdworkers; the proportion of correct answers served as a difficulty score.
Model Suite – Three LLM families were selected: Qwen‑3 (open‑source), Claude Haiku‑4.5 (Anthropic), and GPT‑5‑mini (OpenAI). For each family, four parameter scales were tested (≈0.5 B → 4 B), yielding 28 total configurations.
Inference Budget – Models were prompted with varying “thinking budgets” (number of generated tokens before final answer) to assess sensitivity to compute allocation.
Evaluation Metrics – Primary metric: F1 score on constraint satisfaction (exact match on required characters). Secondary analyses included correlation with human difficulty and per‑word error breakdown.
Statistical Analysis – Pairwise architectural comparisons used bootstrap confidence intervals; correlation with human difficulty employed Pearson’s r.

Results & Findings

Architectural dominance: Qwen‑3 models achieved the highest average F1 (0.761), while Claude Haiku‑4.5 lagged (0.343). The gap persisted across all parameter scales.
Scaling effect: Within each family, moving from the smallest to the largest model improved F1 by ~0.08 (≈83 % relative gain), but this was modest compared to the cross‑family gap.
Thinking budget: High‑capacity models (≥2 B parameters) benefitted from longer inference windows, gaining up to +0.136 F1. Mid‑size models (≈1 B) showed diminishing returns, sometimes dropping performance when the budget was increased.
Human alignment: Model success correlated positively with human difficulty scores (r = 0.24–0.38), indicating that models are roughly sensitive to puzzle hardness but far from perfect.
Systematic orthographic blind spots: For a subset of high‑frequency words with irregular spelling (“data”, “poop”, “loll”), human success exceeded 86 % while model miss rates ranged from 89 % to 96 %. Errors stem from the model favoring statistically common spelling patterns over the explicit constraint.

Practical Implications

Tooling for constrained generation: Developers building autocomplete, code‑completion, or puzzle‑generation systems should not assume that larger models automatically handle strict character constraints; architecture matters more than raw size.
Prompt engineering limits: Simple “think‑longer” tricks (e.g., increasing max tokens) only help high‑capacity models. For mid‑range models, developers may need to redesign prompts or add external validation loops.
Hybrid pipelines: The identified failure modes suggest a practical architecture where an LLM proposes candidates and a lightweight orthographic validator (regex or finite‑state automaton) filters them, ensuring hard constraints are met.
Domain‑specific fine‑tuning: Industries that rely on precise naming conventions (e.g., chemical nomenclature, product codes) could benefit from fine‑tuning on orthographically constrained datasets or adding auxiliary loss terms that penalize constraint violations.
Benchmarking standards: The puzzle suite can serve as a quick sanity check for any new LLM before deployment in applications where spelling accuracy is mission‑critical (e.g., medical transcription, legal document drafting).

Limitations & Future Work

Scope of puzzles: The benchmark focuses on short English words; extending to longer phrases, multilingual orthographies, or domain‑specific vocabularies would test generality.
Model families: Only three families were examined; newer architectures (e.g., mixture‑of‑experts, retrieval‑augmented models) might behave differently.
Training data bias: The analysis attributes failures to “distributional plausibility,” but does not isolate whether the issue lies in pre‑training corpora, tokenization, or decoding strategies.
Human difficulty granularity: Difficulty scores are aggregated across many solvers; future work could explore individual differences (e.g., native vs. non‑native speakers) to refine calibration metrics.
Architectural innovations: The authors suggest specialized components (e.g., constraint‑aware attention heads) but leave concrete designs to subsequent research.

Bottom line: When your product demands that a language model obeys hard spelling rules, picking the right architecture—and possibly augmenting it with explicit constraint checks—will matter far more than simply scaling up the number of parameters.

Authors

Bryan E. Tuck
Rakesh M. Verma

Paper Information

arXiv ID: 2511.21086v1
Categories: cs.CL
Published: November 26, 2025
PDF: Download PDF

[Paper] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

[Paper] Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

[Paper] Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation