[Paper] A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation

Published: 1 month ago (December 25, 2025 at 07:44 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.21811v1

Overview

Log parsing is the backbone of automated log analytics, turning raw, free‑form log strings into structured event templates that machines can reason about. The new paper introduces PMSS (Parser Medoid Silhouette Score), a label‑free metric that lets engineers evaluate and compare parsers without needing hand‑crafted ground‑truth templates—a common bottleneck in production environments.

Key Contributions

Label‑free evaluation: PMSS measures parser quality without any pre‑labeled data, sidestepping the costly and error‑prone annotation process.
Template‑level focus: Unlike token‑level metrics, PMSS assesses the cohesion (how similar templates produced by the same parser are) and separation (how distinct they are from other parsers) of the generated template sets.
Near‑linear runtime: The metric leverages medoid silhouette analysis and Levenshtein distance, achieving practically linear time complexity even on large log corpora.
Empirical validation: Experiments on the corrected Loghub 2.0 dataset show strong correlation (Spearman ρ ≈ 0.6) between PMSS and the established label‑based metrics FGA and FTA.
Guidelines for practitioners: The authors provide concrete steps for using PMSS in parser selection pipelines and discuss how to interpret its scores alongside traditional metrics.

Methodology

Parser clustering: Each log parser’s output (the set of extracted templates) is treated as a cluster.
Medoid identification: For each cluster, the medoid—the template with the smallest average Levenshtein distance to all other templates in the same cluster—is selected.
Silhouette computation:
- Cohesion (a): average Levenshtein distance between a template and its own cluster’s medoid.
- Separation (b): average distance to the nearest other parser’s medoid.
- The silhouette score for a template is (b - a) / max(a, b).
PMSS aggregation: The final PMSS is the mean silhouette score across all templates from all parsers. A higher PMSS indicates that parsers produce internally consistent templates that are well‑separated from each other.
Complexity: Computing pairwise Levenshtein distances is bounded by O(N · L) where N is the number of templates and L is average template length, making the approach scalable to millions of log lines.

Results & Findings

Parser (selected)	PMSS	FGA (label‑based)	FTA (label‑based)
Parser A (best PMSS)	0.73	0.81	0.68
Parser B (best FGA)	0.71	0.83	0.70
…	…	…	…

Correlation: PMSS correlates with FGA (ρ = 0.648) and FTA (ρ = 0.587), comparable to the correlation between FGA and FTA themselves (ρ = 0.670).
Performance gap: The top‑ranked parser by PMSS is within 2.1 % of the top FGA score and 9.8 % of the top FTA score, indicating that PMSS can reliably surface the same high‑quality parsers.
Statistical significance: The positive relationship between PMSS and the label‑based metrics is highly significant (p < 1e‑8).

Practical Implications

Zero‑label deployment: Teams can now benchmark new or custom parsers on production logs where ground truth is unavailable, accelerating the evaluation loop.
Robust parser selection: By focusing on template cohesion and separation, PMSS helps avoid “over‑fitting” to a particular labeled dataset, leading to parsers that generalize better across environments.
Continuous monitoring: PMSS can be integrated into CI/CD pipelines to automatically flag regressions in parser quality after code changes or configuration tweaks.
Cost savings: Eliminating the need for manual labeling reduces labor costs and mitigates the risk of inconsistent ground‑truth versions that have plagued prior studies.

Limitations & Future Work

Dependence on Levenshtein distance: While fast, Levenshtein may not capture semantic similarity for highly variable templates (e.g., timestamps, IDs).
Assumes parsers produce comparable template sets: If a parser is extremely aggressive (producing many tiny templates) or overly conservative (few generic templates), silhouette scores can be biased.
Scalability edge cases: Extremely large template vocabularies (tens of millions) may still challenge the near‑linear claim; optimized approximate distance measures could help.
Future directions: The authors plan to explore alternative string similarity metrics, extend PMSS to multi‑modal logs (e.g., JSON + plain text), and validate the metric on real‑world incident‑response datasets.

Authors

Qiaolin Qin
Jianchen Zhao
Heng Li
Weiyi Shang
Ettore Merlo

Paper Information

arXiv ID: 2512.21811v1
Categories: cs.SE
Published: December 26, 2025
PDF: Download PDF

[Paper] A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] HALF: Process Hollowing Analysis Framework for Binary Programs with the Assistance of Kernel Modules

[Paper] Analyzing Code Injection Attacks on LLM-based Multi-Agent Systems in Software Development

[Paper] The State of the SBOM Tool Ecosystems: A Comparative Analysis of SPDX and CycloneDX

[Paper] Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation