[Paper] OODEval: Evaluating Large Language Models on Object-Oriented Design

Published: (January 12, 2026 at 09:51 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07602v1

Overview

The paper introduces OODEval, the first systematic benchmark for testing how well large language models (LLMs) can perform object‑oriented design (OOD) tasks such as generating class diagrams. By evaluating 29 state‑of‑the‑art LLMs on 50 curated design problems and a large human‑rated dataset, the authors reveal where current models excel (syntactic correctness) and where they still lag behind human designers (semantic richness and relationship modeling).

Key Contributions

  • OODEval benchmark – 50 manually crafted OOD tasks covering a spectrum of difficulty levels.
  • OODEval‑Human dataset – 940 undergraduate‑submitted class diagrams annotated by instructors, providing a realistic “human baseline.”
  • CLUE metric suite – a unified evaluation framework (Class Likeness Unified Evaluation) that measures both overall diagram correctness and fine‑grained design quality (e.g., method completeness, relationship accuracy).
  • Comprehensive empirical study – performance comparison of 29 LLMs, including open‑source (Qwen3‑Coder‑30B, Gemma3‑4B‑IT) and commercial models (GPT‑4o, DeepSeek‑R1).
  • Insightful analysis – impact of model size, code‑specialization, instruction tuning, task complexity, and requirement readability on OOD performance.
  • Error taxonomy – systematic categorization of common failure modes (keyword misuse, missing classes/relationships, omitted methods).

Methodology

  1. Benchmark Construction

    • Task design: 50 OOD scenarios (e.g., library management system, e‑commerce platform) were authored by software‑engineering experts and graded for difficulty.
    • Human baseline: 940 class diagrams submitted by undergraduate students were collected and independently rated by course instructors, establishing a realistic performance ceiling.
  2. Metric Design (CLUE)

    • Global correctness: Checks whether the generated diagram contains the required classes, attributes, and relationships.
    • Fine‑grained quality: Scores method signatures, visibility modifiers, inheritance, aggregation/composition, and naming conventions.
    • Scores are normalized to a 0–100 scale, enabling direct comparison across models and against human scores.
  3. Model Evaluation

    • Prompted each of the 29 LLMs with the same natural‑language requirement description.
    • Collected the generated class diagrams (UML‑style textual representation).
    • Applied CLUE automatically and, for the human dataset, used the instructor ratings as ground truth.
  4. Analysis Dimensions

    • RQ1: Overall correctness across models.
    • RQ2: How LLM performance stacks up against average and top human designers.
    • RQ3: Influence of model attributes (parameter count, code‑specialization, instruction tuning).
    • RQ4: Effect of task features (design complexity, readability).
    • RQ5: Qualitative “bad case” inspection to surface systematic weaknesses.

Results & Findings

  • Syntactic vs. Semantic Gap: All models achieved >90 % syntactic accuracy (correct UML syntax), but semantic scores (method completeness, relationship correctness) dropped to 55–70 %.
  • Top Performers: Qwen3‑Coder‑30B led the pack, closely followed by DeepSeek‑R1 and GPT‑4o. Notably, Gemma3‑4B‑IT (4 B parameters) outperformed GPT‑4o‑Mini, showing that code‑specialization can outweigh raw size.
  • Human Comparison: The best LLMs matched the average undergraduate score (≈68 % CLUE) but remained ~15 % points behind the best human designers.
  • Model Drivers:
    • Larger parameter counts correlate positively with higher CLUE scores, but the effect plateaus beyond ~30 B.
    • Models fine‑tuned on code (e.g., “Coder” variants) consistently beat general‑purpose LLMs of similar size.
    • Instruction‑tuned models handle requirement phrasing better, reducing the readability penalty.
  • Task Difficulty: As the number of required classes and relationships grew, CLUE scores fell by ~8 % per additional class. Poorly worded requirements (low readability) caused a ~5 % drop.
  • Error Taxonomy: The most frequent mistakes were:
    1. Keyword misuse – inserting unrelated UML elements.
    2. Missing entities – omitting required classes or relationships.
    3. Method omission – forgetting essential operations (e.g., addItem() in a Cart class).

Practical Implications

  • Design‑assist tools: LLMs can reliably generate syntactically correct class diagrams, making them useful for rapid prototyping or as “draft‑first” assistants in IDE plugins.
  • Code‑generation pipelines: Since many downstream code generators consume class diagrams, integrating a high‑performing LLM (e.g., Qwen3‑Coder‑30B) can shorten the design‑to‑code cycle, especially for straightforward domains.
  • Educational support: LLMs that approach average student performance could serve as automated tutoring agents, offering instant feedback on design assignments.
  • Model selection guidance: For teams with limited compute budgets, a smaller, code‑specialized model (Gemma3‑4B‑IT) may deliver comparable design quality to larger generalist models.
  • Prompt engineering: Emphasizing clear, readable requirement statements (bullet lists, explicit relationships) can mitigate the readability penalty and improve output quality.

Limitations & Future Work

  • Benchmark scope: OODEval covers only class‑diagram generation; other design artifacts (sequence diagrams, architectural views) remain untested.
  • Human dataset bias: The undergraduate submissions reflect a single academic curriculum and may not capture professional design standards.
  • Metric granularity: While CLUE balances global and fine‑grained aspects, it does not evaluate design principles such as SOLID or domain‑driven design patterns.
  • Dynamic designs: The study focuses on static structural models; future work could explore how LLMs handle behavioral specifications (methods’ internal logic, state machines).
  • Iterative refinement: Current evaluation is single‑shot; investigating multi‑turn interactions where the model can ask clarification questions may close the semantic gap.

Bottom line: LLMs are now competent enough to draft correct‑looking class diagrams, but achieving human‑level semantic richness still requires better model training, richer prompts, and possibly interactive refinement loops. Developers can start leveraging these models for early‑stage design, while researchers have a clear roadmap for pushing the boundary toward truly intelligent design assistants.

Authors

  • Bingxu Xiao
  • Yunwei Dong
  • Yiqi Tang
  • Manqing Zhang
  • Yifan Zhou
  • Chunyan Ma
  • Yepang Liu

Paper Information

  • arXiv ID: 2601.07602v1
  • Categories: cs.SE
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »