[Paper] Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures
Source: arXiv - 2601.03988v1
Overview
This paper investigates whether small language models (SLMs)—lightweight versions of the large AI models that power tools like GitHub Copilot—can automatically reverse‑engineer the structure of machine‑learning pipelines from raw source code. By doing so, the authors aim to replace brittle, manually‑labelled approaches with a more scalable, adaptable solution that keeps pace with the fast‑moving ML ecosystem.
Key Contributions
- Empirical evaluation of SLMs for classifying pipeline stages (e.g., data ingestion, preprocessing, model training) directly from code snippets.
- Statistical rigor: uses Cochran’s Q test to compare multiple SLMs, followed by McNemar’s tests to benchmark the best model against two prior state‑of‑the‑art studies.
- Taxonomy sensitivity analysis: shows how redefining the pipeline‑stage taxonomy influences classification performance.
- Goodness‑of‑fit comparison: aligns the insights derived from the SLM‑based extraction with findings from earlier manual/ML‑based analyses, using Pearson’s chi‑squared test.
- Open‑source tooling: releases the evaluation pipeline and annotated dataset, enabling reproducibility and further research.
Methodology
- Dataset construction – The authors curated a corpus of open‑source ML projects (Python, R, Java) and manually annotated each file with its corresponding pipeline stage, creating a gold‑standard reference.
- Model selection – Several publicly available SLMs (e.g., CodeBERT‑small, GPT‑2‑distilled, StarCoder‑base) were fine‑tuned on a small portion of the annotated data.
- Statistical testing –
- Cochran’s Q test compared the binary classification accuracy of all SLMs across the same test set, identifying the top performer.
- McNemar’s tests (two separate tests) measured whether the best SLM’s predictions differed significantly from the results reported in two earlier benchmark papers.
- Taxonomy variation – The authors altered the granularity of the stage taxonomy (e.g., merging “feature engineering” and “data cleaning”) and reran Cochran’s Q to see the effect on model performance.
- Goodness‑of‑fit – Pearson’s chi‑squared test compared the distribution of extracted pipeline stages to the distributions reported in prior studies, checking for alignment.
All experiments were run on commodity GPUs, emphasizing the “small” nature of the models.
Results & Findings
- Best SLM: A distilled version of CodeBERT achieved 84 % macro‑F1, outperforming the baseline ML classifiers (≈72 % F1) used in earlier work.
- Statistical significance: Cochran’s Q test confirmed the superiority of the top SLM (p < 0.01). McNemar’s tests showed no significant difference between the SLM’s stage distribution and that of the two reference studies (p > 0.05), indicating comparable insight quality.
- Taxonomy impact: Coarser taxonomies boosted accuracy by up to 6 %, while overly fine‑grained categories caused a drop, highlighting a trade‑off between detail and reliability.
- Goodness‑of‑fit: The chi‑squared analysis revealed that the SLM‑derived stage frequencies matched prior manual analyses within a 95 % confidence interval, suggesting the model captures real‑world data‑science practices.
Practical Implications
- Automated code audits – DevOps teams can embed the SLM into CI pipelines to flag missing or mis‑ordered stages (e.g., training without validation) before deployment.
- Tooling for data‑science governance – Enterprises can automatically generate pipeline documentation, aiding compliance and reproducibility without manual effort.
- Rapid onboarding – New team members can get a high‑level view of a project’s ML workflow by scanning source files, accelerating knowledge transfer.
- Ecosystem‑agnostic analysis – Because SLMs are lightweight and can be fine‑tuned on a few examples, the approach scales across languages and emerging libraries (e.g., PyTorch Lightning, Hugging Face Transformers).
Limitations & Future Work
- Dataset bias – The curated corpus leans heavily toward Python notebooks; results may differ for production‑grade Java/Scala pipelines.
- Granularity ceiling – Very fine‑grained stage distinctions (e.g., “hyper‑parameter search strategy”) remain challenging for SLMs.
- Model size vs. performance – While small models work well, the authors note that larger LLMs could push accuracy higher, at the cost of compute.
- Future directions – Extending the taxonomy to cover MLOps artifacts (Dockerfiles, CI configs), exploring few‑shot prompting with larger LLMs, and integrating the classifier into IDE plugins for real‑time feedback.
Authors
- Nicolas Lacroix
- Mireille Blay-Fornarino
- Sébastien Mosser
- Frederic Precioso
Paper Information
- arXiv ID: 2601.03988v1
- Categories: cs.SE, cs.LG
- Published: January 7, 2026
- PDF: Download PDF