[Paper] Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Published: 1 week ago (January 12, 2026 at 01:02 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.07790v1

Overview

System logs are the nervous system of modern infrastructure, but their sheer volume makes manual analysis impossible. This paper treats log‑severity classification not as a final product but as a benchmark to gauge how well small language models (SLMs) and small reasoning language models (SRLMs) actually understand runtime logs. Using real‑world journalctl data from production Linux servers, the authors compare nine compact models across zero‑shot, few‑shot, and retrieval‑augmented generation (RAG) prompting, revealing which architectures are truly ready for on‑device or edge deployment in digital‑twin (DT) and root‑cause‑analysis pipelines.

Key Contributions

Benchmark Design: Introduces a realistic severity‑classification benchmark that isolates log comprehension ability rather than relying on metadata alone.
Comprehensive Evaluation: Tests nine SLMs/SRLMs under three prompting regimes (zero‑shot, few‑shot, RAG) on a production‑grade log dataset.
Performance Stratification: Shows that retrieval‑augmented prompting can dramatically boost tiny models (e.g., Qwen3‑0.6B reaches 88 % accuracy) while some SRLMs actually degrade with RAG.
Efficiency Profiling: Provides per‑log inference latency measurements, highlighting models that meet real‑time constraints (< 1.2 s) versus those that do not (e.g., Phi‑4‑Mini‑Reasoning > 200 s).
Design Insights: Links three factors—model architecture, training objective, and retrieval integration—to observed accuracy and speed, offering a roadmap for building deployable log‑analysis models.

Methodology

Dataset: Collected ~200 k log entries from journalctl on live Linux servers, each labeled with the native severity level (e.g., INFO, WARN, ERROR).
Models: Selected nine open‑source models ranging from 0.6 B to 4 B parameters, including Gemma, Llama, Qwen, DeepSeek, and Phi variants. Both “plain” SLMs and “reasoning‑enhanced” SRLMs were evaluated.
Prompting Strategies:
- Zero‑shot: Model receives only the raw log line and a request to output the severity.
- Few‑shot: A handful of example log‑severity pairs are appended to the prompt.
- RAG: An external vector store of log embeddings is queried; the top‑k similar logs and their severities are injected into the prompt, giving the model additional context.
Metrics: Accuracy (primary), inference latency (seconds per log), and memory footprint. Experiments were run on a single A100 GPU to keep hardware conditions consistent.
Analysis: Compared accuracy gains across prompting regimes and correlated them with latency to assess real‑time suitability.

Results & Findings

Model (Params)	Prompting	Accuracy	Avg. Latency (s)
Qwen3‑4B	RAG	95.64 %	1.08
Gemma3‑1B	RAG	85.28 %	0.94
Gemma3‑1B	Few‑shot	20.25 %	0.92
Qwen3‑0.6B	RAG	88.12 %	0.87
Qwen3‑0.6B	Zero‑shot	45.03 %	0.86
Qwen3‑1.7B (SRLM)	RAG	62.41 %	1.15
DeepSeek‑R1‑Distill‑Qwen‑1.5B (SRLM)	RAG	58.77 %	1.22
Phi‑4‑Mini‑Reasoning	RAG	<10 %	228.4

Takeaways

RAG is a game‑changer for compact models: the 0.6 B Qwen jumps from ~45 % to >88 % accuracy.
Reasoning‑oriented SRLMs don’t automatically benefit from retrieval; some even regress, suggesting a mismatch between their training objectives and the strict “single‑token” output format of severity labels.
Latency matters: Most Gemma and Llama variants stay under 1.2 s per log, making them viable for real‑time DT pipelines, whereas Phi‑4‑Mini‑Reasoning is impractically slow.

Practical Implications

Edge/On‑Device Monitoring: Tiny models like Qwen3‑0.6B can be shipped to low‑power appliances (e.g., IoT gateways) and still achieve near‑state‑of‑the‑art severity detection when paired with a lightweight retrieval index.
Digital Twin Integration: Real‑time severity classification can feed DT simulations with accurate failure signals, enabling proactive RCA and automated remediation.
Cost‑Effective Ops: Organizations can replace heavyweight LLM APIs with open‑source SLMs, cutting cloud inference bills while maintaining >90 % classification quality.
Tooling Blueprint: The RAG pipeline (vector store + prompt injection) demonstrated here can be repurposed for other log‑analysis tasks—anomaly detection, log summarization, or root‑cause suggestion—without retraining the base model.
Model Selection Guidance: When choosing a model for log‑centric workloads, prioritize (1) small parameter count with strong retrieval support, (2) fast inference (< 1 s), and (3) a training objective aligned with constrained output formats.

Limitations & Future Work

Dataset Scope: The benchmark uses logs from a specific Linux distribution and workload; cross‑OS or cloud‑native log formats may exhibit different challenges.
Strict Output Constraint: Severity labels are single tokens; extending to richer outputs (e.g., multi‑label tagging or natural‑language explanations) could change the relative performance of SRLMs.
Retrieval Overhead Not Fully Accounted: Latency measurements exclude the time to query the vector store; in production, indexing and retrieval costs could affect end‑to‑end latency.
Model Diversity: Only nine models were evaluated; newer open‑source SLMs (e.g., Mistral‑7B, LLaMA‑3) could shift the performance landscape.
Future Directions: The authors suggest expanding the benchmark to multi‑modal logs (e.g., combining syslog with metrics), exploring fine‑tuning on domain‑specific log corpora, and developing adaptive retrieval strategies that balance relevance with latency.

Authors

Yahya Masri
Emily Ma
Zifu Wang
Joseph Rogers
Chaowei Yang

Paper Information

arXiv ID: 2601.07790v1
Categories: cs.AI
Published: January 12, 2026
PDF: Download PDF

[Paper] Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management