[Paper] Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code
Source: arXiv - 2511.20933v1
Overview
Large language models (LLMs) are now commonplace assistants for developers, but how well they truly understand core software‑design principles—specifically cohesion (how tightly related the parts of a module are) and coupling (how dependent modules are on each other)—has been an open question. This paper presents a systematic, hierarchical evaluation of the DeepSeek‑R1 family (14 B, 32 B, 70 B parameters) to expose where these models succeed, where they stumble, and what that means for real‑world coding workflows.
Key Contributions
- Benchmark for design reasoning – A novel, programmatically generated suite of “poorly designed” code snippets that target cohesion and coupling violations.
- Multi‑level evaluation protocol – Three task modes (Verification, Guided, Open‑ended Generation) that progressively reduce external guidance, mimicking realistic developer interactions.
- Noise‑robustness analysis – Systematic injection of distractor code and comments to test model resilience to irrelevant context.
- Empirical findings on asymmetry – Demonstrates that coupling reasoning collapses dramatically under noisy, open‑ended conditions, while cohesion analysis stays comparatively stable—yet both degrade without any guidance.
- Trace‑level diagnostics – Uses the models’ internal reasoning traces to uncover “cognitive shortcutting” for coupling and exhaustive (but still error‑prone) reasoning for cohesion.
Methodology
- Synthetic code generation – The authors built a generator that creates small programs with intentional design flaws: low cohesion (unrelated functions packed together) and high coupling (tight inter‑module dependencies).
- Task hierarchy
- Verification – The model receives a code fragment and a yes/no question (“Is this module cohesive?”).
- Guided – The model is prompted to explain why the fragment is cohesive or coupled, receiving step‑by‑step hints.
- Open‑ended Generation – The model must rewrite or refactor the code to improve design without any explicit prompts.
- Contextual noise – Randomly inserted unrelated functions, comments, or variable names act as distractors, simulating the messy files developers often work with.
- Metrics – Standard precision/recall/F1 for classification tasks, plus BLEU‑style scores for generated refactorings.
- Reasoning‑trace analysis – The LLM’s chain‑of‑thought output is parsed to see whether the model follows a logical design‑analysis path or shortcuts to a guess.
Results & Findings
| Task | Cohesion F1 (ideal) | Cohesion F1 (noisy) | Coupling F1 (ideal) | Coupling F1 (noisy) |
|---|---|---|---|---|
| Verification | 0.88 | 0.84 | 0.86 | 0.81 |
| Guided | 0.91 | 0.89 | 0.89 | 0.62 |
| Open‑ended Generation | 0.78 | 0.75 | 0.73 | 0.33 |
- Cohesion: Even with distractors, guided tasks keep performance above 0.85 F1; the drop is modest.
- Coupling: In the open‑ended setting, F1 plummets by more than 50 % when noise is present, indicating brittle reasoning.
- Model size effect: The 70 B model consistently outperforms the smaller variants, but the asymmetry between cohesion and coupling persists across scales.
- Trace analysis: For coupling, models often skip detailed dependency checks (“shortcut”) and guess based on surface cues. For cohesion, they enumerate function responsibilities but still miss subtle violations when guidance disappears.
Practical Implications
- Code review assistants – LLMs can reliably flag obvious cohesion issues (e.g., unrelated functions in the same file) even in messy repositories, making them useful as a first‑pass reviewer.
- Automated refactoring tools – Current models are not yet trustworthy for autonomous coupling reduction (e.g., extracting services, decoupling modules) in production codebases; human oversight remains essential.
- Prompt engineering – Providing structured, step‑by‑step guidance dramatically improves coupling reasoning. Teams can embed “guided” prompts into IDE extensions to boost model reliability.
- Noise handling – Since real code often contains dead code, comments, and legacy snippets, developers should consider pre‑filtering or context‑window management before feeding code to LLMs.
- Model selection – Larger models give a modest edge, but the fundamental design‑reasoning gap is architectural, not just a matter of parameter count.
Limitations & Future Work
- Synthetic bias – The benchmark relies on generated code patterns, which may not capture the full diversity of real‑world design smells.
- Single model family – Only DeepSeek‑R1 variants were evaluated; results could differ for other LLMs (e.g., GPT‑4, Claude).
- Static analysis only – Dynamic behavior (runtime coupling) is not considered, limiting applicability to performance‑critical systems.
- Future directions – Expanding the dataset with open‑source projects, exploring multi‑turn interactive debugging sessions, and integrating external static‑analysis tools to augment LLM reasoning.
Authors
- Mootez Saad
- Boqi Chen
- José Antonio Hernández López
- Dániel Varró
- Tushar Sharma
Paper Information
- arXiv ID: 2511.20933v1
- Categories: cs.SE
- Published: November 25, 2025
- PDF: Download PDF