[Paper] SEMODS: A Validated Dataset of Open-Source Software Engineering Models
Source: arXiv - 2601.00635v1
Overview
The paper introduces SEMODS, a curated, validated dataset of 3,427 open‑source software‑engineering (SE) models harvested from Hugging Face. By systematically cataloguing these models and linking them to concrete SE tasks (e.g., bug triage, code summarisation, test generation), the authors give developers a “one‑stop shop” for discovering and re‑using AI models that are actually relevant to software‑engineering workflows.
Key Contributions
- Large‑scale SE model collection – 3,427 models scraped from Hugging Face, covering a wide spectrum of SE activities across the software lifecycle.
- Hybrid validation pipeline – Combines automated filtering, manual expert annotation, and Large Language Model (LLM) assistance to ensure high‑quality, trustworthy entries.
- Task‑centric taxonomy – Each model is mapped to a well‑defined SE task and development activity (e.g., code completion, requirements analysis, defect prediction).
- Standardised evaluation metadata – Uniform representation of reported metrics (accuracy, BLEU, F1, etc.) enables apples‑to‑apples comparisons.
- Open‑access dataset & tooling – The authors release the dataset, annotation schema, and scripts for reproducibility and community extension.
Methodology
- Automated Harvesting – Queried the Hugging Face Model Hub using SE‑related keywords and tags, pulling raw metadata for every candidate model.
- Pre‑filtering – Applied simple heuristics (e.g., presence of “code”, “bug”, “test” in model description) to trim the initial pool to a manageable subset.
- Manual Annotation – A team of SE researchers inspected each remaining model, assigning it to a task from a pre‑defined taxonomy and verifying that the model truly targets SE.
- LLM‑Assisted Review – Employed a state‑of‑the‑art LLM to suggest task labels and flag ambiguous entries, which were then confirmed or corrected by humans.
- Standardisation of Metrics – Normalised reported evaluation results into a common JSON schema (model ID, task, dataset, metric name, value, evaluation split).
- Validation & Release – Measured inter‑annotator agreement (Cohen’s κ ≈ 0.78) and packaged the final dataset with scripts for loading, querying, and extending it.
The pipeline balances scalability (automated scraping) with reliability (human‑in‑the‑loop checks), making it feasible to keep the catalogue up‑to‑date as new models appear.
Results & Findings
- Coverage – The final SEMODS catalogue spans 12 SE task categories (e.g., code generation, issue classification, documentation generation) and includes models ranging from small fine‑tuned BERT variants to large code‑centric transformers.
- Quality Assurance – Manual validation confirmed that > 92 % of the models truly address SE problems; the remaining 8 % were either mis‑tagged or generic language models.
- Metric Uniformity – By normalising evaluation results, the authors surfaced trends such as “code‑completion models on the HumanEval benchmark achieve a median pass@1 of 38 %”.
- Discovery Insights – Querying the dataset revealed under‑explored niches (e.g., models for requirements‑traceability) and highlighted popular datasets that dominate model evaluation (e.g., CodeSearchNet, Defects4J).
Practical Implications
- Faster Model Selection – Developers can query SEMODS for “models that generate unit tests for Python” and instantly retrieve a ranked list with performance numbers, cutting down the trial‑and‑error phase.
- Benchmarking Made Easy – Researchers and product teams can pull the standardised metric table to benchmark a new model against the community baseline without re‑running all experiments.
- Model Adaptation & Fine‑Tuning – Knowing which existing models already target a specific SE task helps teams decide whether to fine‑tune an off‑the‑shelf model or train from scratch, saving compute resources.
- Ecosystem Transparency – By exposing the provenance and evaluation details of each model, SEMODS encourages reproducibility and reduces the risk of deploying poorly‑validated AI components in critical development pipelines.
- Tooling Integration – The released Python API can be embedded into CI/CD pipelines, IDE extensions, or internal model registries to automatically suggest the most suitable model for a given code‑base or workflow.
Limitations & Future Work
- Static Snapshot – Although the collection process is repeatable, SEMODS reflects the state of the Hugging Face hub at the time of the study; continuous crawling and incremental updates are needed to keep it current.
- Task Taxonomy Granularity – The current taxonomy groups some nuanced activities (e.g., “bug localisation” vs. “bug triage”) under broader headings, which may limit fine‑grained searches.
- Metric Diversity – Not all models report the same set of metrics, and some evaluation results are missing or based on proprietary datasets, constraining direct comparisons.
- Human Annotation Bottleneck – Scaling the manual validation step to tens of thousands of models will require more sophisticated LLM‑assisted labeling or crowdsourced verification.
Future work outlined by the authors includes automating periodic re‑crawls, expanding the task taxonomy with community feedback, and integrating usage statistics (download counts, star ratings) to surface “popular” as well as “high‑performing” models.
If you’re building AI‑augmented developer tools, SEMODS offers a ready‑made map of the open‑source model landscape—think of it as a “model marketplace” tailored for software engineering.
Authors
- Alexandra González
- Xavier Franch
- Silverio Martínez-Fernández
Paper Information
- arXiv ID: 2601.00635v1
- Categories: cs.SE
- Published: January 2, 2026
- PDF: Download PDF