[Paper] SACS: A Code Smell Dataset using Semi-automatic Generation Approach
Source: arXiv - 2602.15342v1
Overview
The paper introduces SACS, a publicly‑available dataset of code‑smell examples generated through a clever blend of automation and human review. By tackling the chronic shortage of high‑quality labeled data, the authors aim to accelerate machine‑learning research on automated detection and refactoring of problematic code patterns.
Key Contributions
- Semi‑automatic generation pipeline that first creates candidate smelly snippets with rule‑based heuristics, then partitions them for automated acceptance or manual inspection.
- Structured review process: clear annotation guidelines and a custom tool that focus human effort on the most ambiguous samples, boosting label reliability.
- SACS dataset: over 10 k labeled instances for each of three classic smells—Long Method, Large Class, and Feature Envy—released under an open‑source license.
- Benchmark baseline: the authors provide initial machine‑learning experiments demonstrating the dataset’s utility for training and evaluating smell detectors.
Methodology
- Rule‑based candidate extraction – The authors encode well‑known smell criteria (e.g., method length > X lines, class size > Y lines, method accessing many foreign class members) into static‑analysis scripts that scan large open‑source repositories.
- Metric‑driven grouping – Each candidate is scored on several orthogonal metrics (e.g., cyclomatic complexity, number of fields accessed, cohesion scores). Samples that clearly satisfy or violate the thresholds are auto‑accepted into “clean” or “smelly” groups.
- Manual review queue – Ambiguous cases fall into a second bucket. Trained reviewers use a purpose‑built annotation UI, guided by a detailed rubric (what counts as “feature envy”, edge‑case handling, etc.), to confirm or correct the label.
- Quality assurance – Inter‑rater agreement is measured, and any low‑agreement items are revisited. The final dataset is then split into training/validation/test folds for downstream ML work.
Results & Findings
- Dataset size & balance: Each smell class contains ~10 k+ samples with roughly a 1:1 ratio of smelly vs. clean code, providing a well‑balanced benchmark.
- Label reliability: Manual review achieved a Cohen’s κ of 0.82, indicating strong agreement among annotators.
- Baseline ML performance: Simple classifiers (Random Forest, SVM) trained on SACS reached F1 scores between 0.78–0.85 across the three smells, outperforming models trained on previously available, fully‑automatic datasets.
- Efficiency gain: The semi‑automatic pipeline reduced manual effort by ~70 % compared with a fully manual labeling effort, while preserving high label quality.
Practical Implications
- Faster model prototyping – Developers can now train smell‑detection models on a large, vetted dataset without spending weeks on data collection.
- Tool integration – IDE plug‑ins or CI pipelines can adopt models trained on SACS to flag Long Method, Large Class, or Feature Envy early in the development cycle, reducing technical debt.
- Benchmarking & reproducibility – Researchers and industry teams have a common ground for comparing detection algorithms, encouraging more transparent progress in automated refactoring.
- Extensibility – The semi‑automatic framework can be adapted to other smells (e.g., God Class, Data Clumps), enabling organizations to generate custom datasets that reflect their codebase characteristics.
Limitations & Future Work
- Scope of smells – The current release covers only three of the many documented smells; extending to a broader taxonomy remains an open task.
- Language bias – All samples were extracted from Java projects; cross‑language applicability (e.g., Python, C#) needs validation.
- Static analysis reliance – The rule‑based candidate generation may miss context‑dependent smells that require dynamic analysis.
- Future directions proposed include:
- Incorporating runtime metrics.
- Automating the review rubric via active learning.
- Evaluating the impact of SACS‑trained models in real‑world development workflows.
Authors
- Hanyu Zhang
- Tomoji Kishi
Paper Information
- arXiv ID: 2602.15342v1
- Categories: cs.SE
- Published: February 17, 2026
- PDF: Download PDF