[Paper] When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models...