[Paper] Embedding Software Intent: Lightweight Java Module Recovery
Source: arXiv - 2512.15980v1
Overview
The paper introduces ClassLAR, a lightweight technique that automatically extracts Java 9 module definitions from large, monolithic codebases. By treating fully‑qualified class names as “software intent” and feeding them to a language model, the authors can recover modules that align closely with a system’s true architecture—far faster than existing recovery tools.
Key Contributions
- Intent‑driven recovery: Uses semantic cues from package and class names (via a language model) to infer architectural intent without needing heavyweight static analysis.
- Class‑and‑Language model (ClassLAR): A novel hybrid that combines simple name‑based features with contextual embeddings, achieving high fidelity to the ground‑truth module layout.
- Performance gains: Demonstrates 3.99 × – 10.5 × faster execution than state‑of‑the‑art recovery approaches on 20 real‑world Java projects.
- Empirical validation: Comprehensive evaluation on popular open‑source repositories showing superior architectural‑level similarity scores across multiple metrics.
Methodology
- Data collection – For each Java project, the tool extracts every fully‑qualified class name (e.g.,
org.apache.commons.io.FileUtils). - Semantic embedding – A pre‑trained language model (e.g., BERT‑based) converts each name into a dense vector that captures lexical meaning (“File”, “Utils”, “io”).
- Clustering – Vectors are grouped using a lightweight clustering algorithm (e.g., hierarchical agglomerative clustering) that respects the hierarchical nature of package naming.
- Module inference – The resulting clusters are mapped to JPMS module descriptors (
module-info.java), producing a set of modules that reflect both structural (package hierarchy) and functional (semantic similarity) intent. - Evaluation – The recovered modules are compared against manually curated module layouts using architectural similarity metrics such as MoJoFM, NED, and package‑level cohesion.
The entire pipeline runs in a few seconds for medium‑size projects, because it avoids expensive byte‑code analysis and relies only on lightweight text processing and vector operations.
Results & Findings
| Metric | ClassLAR | Best Prior Art |
|---|---|---|
| MoJoFM (higher = more similar) | 85.2 % | 71.4 % |
| NED (lower = less error) | 0.12 | 0.27 |
| Runtime (seconds) | 12 s (avg) | 48 s – 126 s |
- Higher architectural similarity: ClassLAR consistently produced module groupings that matched the developers’ intended architecture more closely than static‑dependency‑based recoveries.
- Speed: Because it only parses class names, the approach scales linearly and stays well under a minute even for projects with >10 k classes.
- Robustness: The language model captured nuanced intent (e.g., “crypto” vs. “security”) that pure package‑structure heuristics missed, leading to better functional cohesion within modules.
Practical Implications
- Rapid JPMS migration: Teams can bootstrap a
module-info.javafor legacy monoliths without manual refactoring, cutting migration time from weeks to hours. - Continuous architecture monitoring: Integrate ClassLAR into CI pipelines to detect drift between code and declared modules, alerting developers before encapsulation bugs appear.
- Tooling ecosystem: The lightweight nature makes it easy to embed in IDE plugins, build tools (Maven/Gradle), or code‑review bots that suggest module boundaries on the fly.
- Improved maintainability: By aligning the codebase with explicit modules, developers gain stronger encapsulation guarantees, clearer dependency graphs, and better support for Java’s service‑loader mechanism.
Limitations & Future Work
- Name quality dependency: Projects with poorly chosen package/class names (e.g., generic
utilpackages) can degrade embedding quality, leading to less accurate modules. - Language model scope: The current model is trained on general Java corpora; domain‑specific vocabularies (e.g., scientific computing) may require fine‑tuning.
- Dynamic behavior ignored: ClassLAR does not analyze runtime reflection or dynamic class loading, which could affect module boundaries in highly dynamic systems.
Future research directions include: incorporating lightweight static dependency graphs to complement name semantics, adapting the approach for other JVM languages (Kotlin, Scala), and exploring incremental recovery for continuously evolving codebases.
Authors
- Yirui He
- Yuqi Hu
- Xingyu Chen
- Joshua Garcia
Paper Information
- arXiv ID: 2512.15980v1
- Categories: cs.SE, cs.AI
- Published: December 17, 2025
- PDF: Download PDF