[Paper] A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation
Source: arXiv - 2512.21811v1
Overview
Log parsing is the backbone of automated log analytics, turning raw, free‑form log strings into structured event templates that machines can reason about. The new paper introduces PMSS (Parser Medoid Silhouette Score), a label‑free metric that lets engineers evaluate and compare parsers without needing hand‑crafted ground‑truth templates—a common bottleneck in production environments.
Key Contributions
- Label‑free evaluation: PMSS measures parser quality without any pre‑labeled data, sidestepping the costly and error‑prone annotation process.
- Template‑level focus: Unlike token‑level metrics, PMSS assesses the cohesion (how similar templates produced by the same parser are) and separation (how distinct they are from other parsers) of the generated template sets.
- Near‑linear runtime: The metric leverages medoid silhouette analysis and Levenshtein distance, achieving practically linear time complexity even on large log corpora.
- Empirical validation: Experiments on the corrected Loghub 2.0 dataset show strong correlation (Spearman ρ ≈ 0.6) between PMSS and the established label‑based metrics FGA and FTA.
- Guidelines for practitioners: The authors provide concrete steps for using PMSS in parser selection pipelines and discuss how to interpret its scores alongside traditional metrics.
Methodology
- Parser clustering: Each log parser’s output (the set of extracted templates) is treated as a cluster.
- Medoid identification: For each cluster, the medoid—the template with the smallest average Levenshtein distance to all other templates in the same cluster—is selected.
- Silhouette computation:
- Cohesion (a): average Levenshtein distance between a template and its own cluster’s medoid.
- Separation (b): average distance to the nearest other parser’s medoid.
- The silhouette score for a template is
(b - a) / max(a, b).
- PMSS aggregation: The final PMSS is the mean silhouette score across all templates from all parsers. A higher PMSS indicates that parsers produce internally consistent templates that are well‑separated from each other.
- Complexity: Computing pairwise Levenshtein distances is bounded by O(N · L) where N is the number of templates and L is average template length, making the approach scalable to millions of log lines.
Results & Findings
| Parser (selected) | PMSS | FGA (label‑based) | FTA (label‑based) |
|---|---|---|---|
| Parser A (best PMSS) | 0.73 | 0.81 | 0.68 |
| Parser B (best FGA) | 0.71 | 0.83 | 0.70 |
| … | … | … | … |
- Correlation: PMSS correlates with FGA (ρ = 0.648) and FTA (ρ = 0.587), comparable to the correlation between FGA and FTA themselves (ρ = 0.670).
- Performance gap: The top‑ranked parser by PMSS is within 2.1 % of the top FGA score and 9.8 % of the top FTA score, indicating that PMSS can reliably surface the same high‑quality parsers.
- Statistical significance: The positive relationship between PMSS and the label‑based metrics is highly significant (p < 1e‑8).
Practical Implications
- Zero‑label deployment: Teams can now benchmark new or custom parsers on production logs where ground truth is unavailable, accelerating the evaluation loop.
- Robust parser selection: By focusing on template cohesion and separation, PMSS helps avoid “over‑fitting” to a particular labeled dataset, leading to parsers that generalize better across environments.
- Continuous monitoring: PMSS can be integrated into CI/CD pipelines to automatically flag regressions in parser quality after code changes or configuration tweaks.
- Cost savings: Eliminating the need for manual labeling reduces labor costs and mitigates the risk of inconsistent ground‑truth versions that have plagued prior studies.
Limitations & Future Work
- Dependence on Levenshtein distance: While fast, Levenshtein may not capture semantic similarity for highly variable templates (e.g., timestamps, IDs).
- Assumes parsers produce comparable template sets: If a parser is extremely aggressive (producing many tiny templates) or overly conservative (few generic templates), silhouette scores can be biased.
- Scalability edge cases: Extremely large template vocabularies (tens of millions) may still challenge the near‑linear claim; optimized approximate distance measures could help.
- Future directions: The authors plan to explore alternative string similarity metrics, extend PMSS to multi‑modal logs (e.g., JSON + plain text), and validate the metric on real‑world incident‑response datasets.
Authors
- Qiaolin Qin
- Jianchen Zhao
- Heng Li
- Weiyi Shang
- Ettore Merlo
Paper Information
- arXiv ID: 2512.21811v1
- Categories: cs.SE
- Published: December 26, 2025
- PDF: Download PDF