[Paper] From Monolith to Microservices: A Comparative Evaluation of Decomposition Frameworks
Source: arXiv - 2601.23141v1
Overview
The paper “From Monolith to Microservices: A Comparative Evaluation of Decomposition Frameworks” tackles one of the most painful steps in modernizing legacy systems: automatically carving a monolithic codebase into well‑defined microservices. By rigorously benchmarking a wide range of static, dynamic, and hybrid decomposition tools on common open‑source applications, the authors provide the first head‑to‑head comparison that developers can actually trust when choosing a migration strategy.
Key Contributions
- Unified evaluation pipeline – a reproducible metric‑computation framework that normalizes results across disparate studies.
- Comprehensive benchmark suite – four widely‑used reference applications (JPetStore, AcmeAir, DayTrader, Plants) covering different domains and code complexities.
- Multi‑dimensional quality metrics – Structural Modularity (SM), Interface Number (IFN), Inter‑partition Communication (ICP), Non‑Extreme Distribution (NED), plus derived indicators for balance and coupling.
- Empirical ranking of state‑of‑the‑art techniques – static analysis, runtime tracing, and hybrid approaches are all evaluated side‑by‑side.
- Practical recommendation – hierarchical clustering, especially the HDBScan algorithm, consistently yields the most balanced service partitions.
Methodology
-
Tool selection – The authors gathered all publicly available microservice decomposition frameworks that fall into three categories:
- Static: rely solely on source‑code structure (e.g., dependency graphs).
- Dynamic: use runtime traces (e.g., method call logs).
- Hybrid: combine static and dynamic information.
-
Benchmark preparation – Each of the four applications was containerized and instrumented to collect the necessary static and dynamic artefacts.
-
Metric pipeline – A custom script ingests the raw output of each framework and computes the five core quality metrics. This eliminates the “apples‑to‑oranges” problem that has plagued prior comparisons.
-
Reproduction & augmentation – Where prior papers reported results, the authors re‑ran the tools using the authors’ replication packages to verify numbers and fill gaps.
-
Statistical analysis – Pairwise comparisons and effect‑size calculations identify which techniques are significantly better on each metric.
Results & Findings
| Technique | SM (higher = better) | IFN (lower = better) | ICP (lower = better) | NED (closer to 0.5 = balanced) |
|---|---|---|---|---|
| HDBScan (hierarchical clustering) | ★★★★★ | ★★★★☆ | ★★★★☆ | ★★★★★ |
| Other hierarchical methods (e.g., Agglomerative) | ★★★★☆ | ★★★★☆ | ★★★★☆ | ★★★★☆ |
| Pure static graph‑based | ★★★☆☆ | ★★☆☆☆ | ★★☆☆☆ | ★★☆☆☆ |
| Pure dynamic trace‑based | ★★★☆☆ | ★★☆☆☆ | ★★☆☆☆ | ★★☆☆☆ |
| Hybrid (simple fusion) | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★★☆ |
- Balanced partitions: HDBScan consistently produced service groups with similar sizes, avoiding “tiny” or “monster” services.
- Modularity vs. communication trade‑off: While static‑only methods achieved decent modularity, they suffered from high inter‑service call volume (ICP).
- Interface overhead: Hierarchical clustering kept the number of exposed interfaces low, simplifying API contracts.
In short, the data shows that hierarchical clustering—particularly density‑based HDBScan—delivers the best overall trade‑off across the evaluated benchmarks.
Practical Implications
- Tool selection: Teams planning a monolith‑to‑microservice migration can prioritize frameworks that implement HDBScan or similar density‑based clustering, expecting fewer cross‑service calls and cleaner APIs.
- Cost estimation: Lower ICP and IFN translate directly into reduced network latency, fewer integration tests, and simpler DevOps pipelines.
- Incremental migration: Because HDBScan yields balanced service sizes, developers can adopt a phased rollout (e.g., “one service per sprint”) without hitting bottlenecks caused by oversized services.
- Automation confidence: The unified metric pipeline can be repurposed as an internal quality gate—run after each decomposition iteration to verify that modularity and communication metrics stay within target thresholds.
- Vendor evaluation: When evaluating commercial microservice extraction platforms, ask for evidence of hierarchical clustering under the hood; the paper provides a concrete benchmark to compare against.
Limitations & Future Work
- Benchmark scope: Only four open‑source applications were used; industrial codebases with millions of lines and heterogeneous tech stacks may exhibit different behavior.
- Metric completeness: The chosen metrics capture structural quality but not runtime performance (e.g., latency under load) or operational concerns like data consistency.
- Tool ecosystem: Some newer decomposition frameworks lacked publicly available replication packages, so they were omitted.
- Future directions: Extending the benchmark to include large‑scale enterprise systems, adding performance‑centric metrics (e.g., request latency, scaling cost), and exploring AI‑driven hybrid approaches are natural next steps.
Authors
- Mineth Weerasinghe
- Himindu Kularathne
- Methmini Madhushika
- Danuka Lakshan
- Nisansa de Silva
- Adeesha Wijayasiri
- Srinath Perera
Paper Information
- arXiv ID: 2601.23141v1
- Categories: cs.SE
- Published: January 30, 2026
- PDF: Download PDF