[Paper] Embedding Software Intent: Lightweight Java Module Recovery

Published: 1 month ago (December 17, 2025 at 04:24 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.15980v1

Overview

The paper introduces ClassLAR, a lightweight technique that automatically extracts Java 9 module definitions from large, monolithic codebases. By treating fully‑qualified class names as “software intent” and feeding them to a language model, the authors can recover modules that align closely with a system’s true architecture—far faster than existing recovery tools.

Key Contributions

Intent‑driven recovery: Uses semantic cues from package and class names (via a language model) to infer architectural intent without needing heavyweight static analysis.
Class‑and‑Language model (ClassLAR): A novel hybrid that combines simple name‑based features with contextual embeddings, achieving high fidelity to the ground‑truth module layout.
Performance gains: Demonstrates 3.99 × – 10.5 ×  faster execution than state‑of‑the‑art recovery approaches on 20 real‑world Java projects.
Empirical validation: Comprehensive evaluation on popular open‑source repositories showing superior architectural‑level similarity scores across multiple metrics.

Methodology

Data collection – For each Java project, the tool extracts every fully‑qualified class name (e.g., org.apache.commons.io.FileUtils).
Semantic embedding – A pre‑trained language model (e.g., BERT‑based) converts each name into a dense vector that captures lexical meaning (“File”, “Utils”, “io”).
Clustering – Vectors are grouped using a lightweight clustering algorithm (e.g., hierarchical agglomerative clustering) that respects the hierarchical nature of package naming.
Module inference – The resulting clusters are mapped to JPMS module descriptors (module-info.java), producing a set of modules that reflect both structural (package hierarchy) and functional (semantic similarity) intent.
Evaluation – The recovered modules are compared against manually curated module layouts using architectural similarity metrics such as MoJoFM, NED, and package‑level cohesion.

The entire pipeline runs in a few seconds for medium‑size projects, because it avoids expensive byte‑code analysis and relies only on lightweight text processing and vector operations.

Results & Findings

Metric	ClassLAR	Best Prior Art
MoJoFM (higher = more similar)	85.2 %	71.4 %
NED (lower = less error)	0.12	0.27
Runtime (seconds)	12 s (avg)	48 s – 126 s

Higher architectural similarity: ClassLAR consistently produced module groupings that matched the developers’ intended architecture more closely than static‑dependency‑based recoveries.
Speed: Because it only parses class names, the approach scales linearly and stays well under a minute even for projects with >10 k classes.
Robustness: The language model captured nuanced intent (e.g., “crypto” vs. “security”) that pure package‑structure heuristics missed, leading to better functional cohesion within modules.

Practical Implications

Rapid JPMS migration: Teams can bootstrap a module-info.java for legacy monoliths without manual refactoring, cutting migration time from weeks to hours.
Continuous architecture monitoring: Integrate ClassLAR into CI pipelines to detect drift between code and declared modules, alerting developers before encapsulation bugs appear.
Tooling ecosystem: The lightweight nature makes it easy to embed in IDE plugins, build tools (Maven/Gradle), or code‑review bots that suggest module boundaries on the fly.
Improved maintainability: By aligning the codebase with explicit modules, developers gain stronger encapsulation guarantees, clearer dependency graphs, and better support for Java’s service‑loader mechanism.

Limitations & Future Work

Name quality dependency: Projects with poorly chosen package/class names (e.g., generic util packages) can degrade embedding quality, leading to less accurate modules.
Language model scope: The current model is trained on general Java corpora; domain‑specific vocabularies (e.g., scientific computing) may require fine‑tuning.
Dynamic behavior ignored: ClassLAR does not analyze runtime reflection or dynamic class loading, which could affect module boundaries in highly dynamic systems.

Future research directions include: incorporating lightweight static dependency graphs to complement name semantics, adapting the approach for other JVM languages (Kotlin, Scala), and exploring incremental recovery for continuously evolving codebases.

Authors

Yirui He
Yuqi Hu
Xingyu Chen
Joshua Garcia

Paper Information

arXiv ID: 2512.15980v1
Categories: cs.SE, cs.AI
Published: December 17, 2025
PDF: Download PDF

[Paper] Embedding Software Intent: Lightweight Java Module Recovery

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy