[Paper] When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Published: 3 days ago (June 8, 2026 at 11:45 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09662v1

Overview

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors—some prompts improve while others worsen—rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan’s opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Sai Adith Senthil Kumar

Paper Information

arXiv ID: 2606.09662v1
Categories: cs.CL
Published: June 8, 2026
PDF: Download PDF

[Paper] When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Doc-to-Atom: Learning to Compile and Compose Memory Atoms

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5