[Paper] Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification
Source: arXiv - 2601.07790v1
Overview
System logs are the nervous system of modern infrastructure, but their sheer volume makes manual analysis impossible. This paper treats log‑severity classification not as a final product but as a benchmark to gauge how well small language models (SLMs) and small reasoning language models (SRLMs) actually understand runtime logs. Using real‑world journalctl data from production Linux servers, the authors compare nine compact models across zero‑shot, few‑shot, and retrieval‑augmented generation (RAG) prompting, revealing which architectures are truly ready for on‑device or edge deployment in digital‑twin (DT) and root‑cause‑analysis pipelines.
Key Contributions
- Benchmark Design: Introduces a realistic severity‑classification benchmark that isolates log comprehension ability rather than relying on metadata alone.
- Comprehensive Evaluation: Tests nine SLMs/SRLMs under three prompting regimes (zero‑shot, few‑shot, RAG) on a production‑grade log dataset.
- Performance Stratification: Shows that retrieval‑augmented prompting can dramatically boost tiny models (e.g., Qwen3‑0.6B reaches 88 % accuracy) while some SRLMs actually degrade with RAG.
- Efficiency Profiling: Provides per‑log inference latency measurements, highlighting models that meet real‑time constraints (< 1.2 s) versus those that do not (e.g., Phi‑4‑Mini‑Reasoning > 200 s).
- Design Insights: Links three factors—model architecture, training objective, and retrieval integration—to observed accuracy and speed, offering a roadmap for building deployable log‑analysis models.
Methodology
- Dataset: Collected ~200 k log entries from
journalctlon live Linux servers, each labeled with the native severity level (e.g.,INFO,WARN,ERROR). - Models: Selected nine open‑source models ranging from 0.6 B to 4 B parameters, including Gemma, Llama, Qwen, DeepSeek, and Phi variants. Both “plain” SLMs and “reasoning‑enhanced” SRLMs were evaluated.
- Prompting Strategies:
- Zero‑shot: Model receives only the raw log line and a request to output the severity.
- Few‑shot: A handful of example log‑severity pairs are appended to the prompt.
- RAG: An external vector store of log embeddings is queried; the top‑k similar logs and their severities are injected into the prompt, giving the model additional context.
- Metrics: Accuracy (primary), inference latency (seconds per log), and memory footprint. Experiments were run on a single A100 GPU to keep hardware conditions consistent.
- Analysis: Compared accuracy gains across prompting regimes and correlated them with latency to assess real‑time suitability.
Results & Findings
| Model (Params) | Prompting | Accuracy | Avg. Latency (s) |
|---|---|---|---|
| Qwen3‑4B | RAG | 95.64 % | 1.08 |
| Gemma3‑1B | RAG | 85.28 % | 0.94 |
| Gemma3‑1B | Few‑shot | 20.25 % | 0.92 |
| Qwen3‑0.6B | RAG | 88.12 % | 0.87 |
| Qwen3‑0.6B | Zero‑shot | 45.03 % | 0.86 |
| Qwen3‑1.7B (SRLM) | RAG | 62.41 % | 1.15 |
| DeepSeek‑R1‑Distill‑Qwen‑1.5B (SRLM) | RAG | 58.77 % | 1.22 |
| Phi‑4‑Mini‑Reasoning | RAG | <10 % | 228.4 |
Takeaways
- RAG is a game‑changer for compact models: the 0.6 B Qwen jumps from ~45 % to >88 % accuracy.
- Reasoning‑oriented SRLMs don’t automatically benefit from retrieval; some even regress, suggesting a mismatch between their training objectives and the strict “single‑token” output format of severity labels.
- Latency matters: Most Gemma and Llama variants stay under 1.2 s per log, making them viable for real‑time DT pipelines, whereas Phi‑4‑Mini‑Reasoning is impractically slow.
Practical Implications
- Edge/On‑Device Monitoring: Tiny models like Qwen3‑0.6B can be shipped to low‑power appliances (e.g., IoT gateways) and still achieve near‑state‑of‑the‑art severity detection when paired with a lightweight retrieval index.
- Digital Twin Integration: Real‑time severity classification can feed DT simulations with accurate failure signals, enabling proactive RCA and automated remediation.
- Cost‑Effective Ops: Organizations can replace heavyweight LLM APIs with open‑source SLMs, cutting cloud inference bills while maintaining >90 % classification quality.
- Tooling Blueprint: The RAG pipeline (vector store + prompt injection) demonstrated here can be repurposed for other log‑analysis tasks—anomaly detection, log summarization, or root‑cause suggestion—without retraining the base model.
- Model Selection Guidance: When choosing a model for log‑centric workloads, prioritize (1) small parameter count with strong retrieval support, (2) fast inference (< 1 s), and (3) a training objective aligned with constrained output formats.
Limitations & Future Work
- Dataset Scope: The benchmark uses logs from a specific Linux distribution and workload; cross‑OS or cloud‑native log formats may exhibit different challenges.
- Strict Output Constraint: Severity labels are single tokens; extending to richer outputs (e.g., multi‑label tagging or natural‑language explanations) could change the relative performance of SRLMs.
- Retrieval Overhead Not Fully Accounted: Latency measurements exclude the time to query the vector store; in production, indexing and retrieval costs could affect end‑to‑end latency.
- Model Diversity: Only nine models were evaluated; newer open‑source SLMs (e.g., Mistral‑7B, LLaMA‑3) could shift the performance landscape.
- Future Directions: The authors suggest expanding the benchmark to multi‑modal logs (e.g., combining syslog with metrics), exploring fine‑tuning on domain‑specific log corpora, and developing adaptive retrieval strategies that balance relevance with latency.
Authors
- Yahya Masri
- Emily Ma
- Zifu Wang
- Joseph Rogers
- Chaowei Yang
Paper Information
- arXiv ID: 2601.07790v1
- Categories: cs.AI
- Published: January 12, 2026
- PDF: Download PDF