[Paper] Embedding Software Intent: Lightweight Java Module Recovery

Published: (December 17, 2025 at 04:24 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.15980v1

Overview

The paper introduces ClassLAR, a lightweight technique that automatically extracts Java 9 module definitions from large, monolithic codebases. By treating fully‑qualified class names as “software intent” and feeding them to a language model, the authors can recover modules that align closely with a system’s true architecture—far faster than existing recovery tools.

Key Contributions

  • Intent‑driven recovery: Uses semantic cues from package and class names (via a language model) to infer architectural intent without needing heavyweight static analysis.
  • Class‑and‑Language model (ClassLAR): A novel hybrid that combines simple name‑based features with contextual embeddings, achieving high fidelity to the ground‑truth module layout.
  • Performance gains: Demonstrates 3.99 × – 10.5 ×  faster execution than state‑of‑the‑art recovery approaches on 20 real‑world Java projects.
  • Empirical validation: Comprehensive evaluation on popular open‑source repositories showing superior architectural‑level similarity scores across multiple metrics.

Methodology

  1. Data collection – For each Java project, the tool extracts every fully‑qualified class name (e.g., org.apache.commons.io.FileUtils).
  2. Semantic embedding – A pre‑trained language model (e.g., BERT‑based) converts each name into a dense vector that captures lexical meaning (“File”, “Utils”, “io”).
  3. Clustering – Vectors are grouped using a lightweight clustering algorithm (e.g., hierarchical agglomerative clustering) that respects the hierarchical nature of package naming.
  4. Module inference – The resulting clusters are mapped to JPMS module descriptors (module-info.java), producing a set of modules that reflect both structural (package hierarchy) and functional (semantic similarity) intent.
  5. Evaluation – The recovered modules are compared against manually curated module layouts using architectural similarity metrics such as MoJoFM, NED, and package‑level cohesion.

The entire pipeline runs in a few seconds for medium‑size projects, because it avoids expensive byte‑code analysis and relies only on lightweight text processing and vector operations.

Results & Findings

MetricClassLARBest Prior Art
MoJoFM (higher = more similar)85.2 %71.4 %
NED (lower = less error)0.120.27
Runtime (seconds)12 s (avg)48 s – 126 s
  • Higher architectural similarity: ClassLAR consistently produced module groupings that matched the developers’ intended architecture more closely than static‑dependency‑based recoveries.
  • Speed: Because it only parses class names, the approach scales linearly and stays well under a minute even for projects with >10 k classes.
  • Robustness: The language model captured nuanced intent (e.g., “crypto” vs. “security”) that pure package‑structure heuristics missed, leading to better functional cohesion within modules.

Practical Implications

  • Rapid JPMS migration: Teams can bootstrap a module-info.java for legacy monoliths without manual refactoring, cutting migration time from weeks to hours.
  • Continuous architecture monitoring: Integrate ClassLAR into CI pipelines to detect drift between code and declared modules, alerting developers before encapsulation bugs appear.
  • Tooling ecosystem: The lightweight nature makes it easy to embed in IDE plugins, build tools (Maven/Gradle), or code‑review bots that suggest module boundaries on the fly.
  • Improved maintainability: By aligning the codebase with explicit modules, developers gain stronger encapsulation guarantees, clearer dependency graphs, and better support for Java’s service‑loader mechanism.

Limitations & Future Work

  • Name quality dependency: Projects with poorly chosen package/class names (e.g., generic util packages) can degrade embedding quality, leading to less accurate modules.
  • Language model scope: The current model is trained on general Java corpora; domain‑specific vocabularies (e.g., scientific computing) may require fine‑tuning.
  • Dynamic behavior ignored: ClassLAR does not analyze runtime reflection or dynamic class loading, which could affect module boundaries in highly dynamic systems.

Future research directions include: incorporating lightweight static dependency graphs to complement name semantics, adapting the approach for other JVM languages (Kotlin, Scala), and exploring incremental recovery for continuously evolving codebases.

Authors

  • Yirui He
  • Yuqi Hu
  • Xingyu Chen
  • Joshua Garcia

Paper Information

  • arXiv ID: 2512.15980v1
  • Categories: cs.SE, cs.AI
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...