[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Published: 19 hours ago (April 23, 2026 at 01:40 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21889v1

Overview

The paper introduces TingIS, a production‑grade system that turns noisy, high‑volume customer incident reports into real‑time risk alerts for large cloud‑native services. By marrying fast indexing with Large Language Models (LLMs) and a layered noise‑filtering pipeline, TingIS can surface actionable incidents within minutes—crucial for preventing costly outages.

Key Contributions

Hybrid event‑linking engine: Combines traditional similarity indexing with LLM‑driven semantic reasoning to decide when disparate incident messages belong to the same underlying risk event.
Cascaded business‑routing architecture: Dynamically attributes incidents to the correct product line or service domain, improving downstream triage.
Multi‑dimensional noise‑reduction pipeline: Leverages domain ontologies, statistical outlier detection, and user‑behavioral signals to suppress irrelevant chatter while preserving rare, high‑impact reports.
Scalable production deployment: Handles >2 k messages/min (≈300 k/day) with a 90th‑percentile alert latency of 3.5 min and a 95 % discovery rate for high‑priority incidents.
Empirical validation: Benchmarks on real‑world incident streams show superior routing accuracy, clustering quality, and signal‑to‑noise ratio compared with baseline clustering or rule‑based systems.

Methodology

Ingestion & Indexing – Incoming incident tickets are first tokenized and stored in an approximate nearest‑neighbor (ANN) index (e.g., HNSW). This provides sub‑millisecond candidate retrieval for any new message.
LLM‑augmented similarity scoring – For each candidate pair, a lightweight LLM (e.g., a distilled transformer) generates a semantic similarity score that captures nuanced business terminology, abbreviations, and context that pure lexical metrics miss.
Event linking decision – A calibrated threshold (learned from historical labeled incidents) determines whether two messages should be merged into a single “risk event”. The system operates in a streaming fashion, updating clusters incrementally.
Cascaded routing – Once an event is formed, a hierarchy of classifiers (rule‑based filters → shallow ML models → LLM‑based intent recognizer) routes the event to the appropriate service team or escalation path.
Noise reduction – Three orthogonal filters prune spurious data:
- Domain knowledge filter – Uses a curated ontology of known error codes, service names, and de‑duplication patterns.
- Statistical filter – Flags outliers based on frequency, temporal burstiness, and historical severity distributions.
- Behavioral filter – Discounts reports from users with low trust scores or repetitive low‑severity submissions.
Alert generation – Cleaned, routed events trigger alerts via existing incident‑management APIs, respecting SLA latency budgets.

Results & Findings

Metric	TingIS	Baseline (rule‑based clustering)
Routing accuracy	92 %	71 %
Clustering F1	0.84	0.61
Signal‑to‑Noise Ratio	4.7× improvement	–
P90 alert latency	3.5 min	9.2 min
High‑priority discovery rate	95 %	68 %

The authors also report that the LLM‑enhanced similarity step adds only ~15 ms per candidate pair, keeping the end‑to‑end pipeline well within the required latency budget. Real‑world A/B tests showed a measurable reduction in mean time to resolution (MTTR) for critical incidents.

Practical Implications

Faster incident response – Developers can rely on TingIS to surface emerging problems before they manifest as full‑blown outages, shaving minutes off MTTR.
Reduced alert fatigue – By aggressively filtering noise, on‑call engineers receive fewer false positives, allowing them to focus on truly risky events.
Cross‑service visibility – The routing layer automatically maps incidents to the correct product team, eliminating manual triage steps that often delay remediation.
Plug‑and‑play architecture – The system is built on open‑source ANN libraries and LLM inference servers, making it adaptable to any organization that already collects customer‑facing tickets (e.g., Slack, Jira, email).
Cost savings – Early detection of high‑impact anomalies can prevent costly downtime, translating into direct financial ROI for cloud providers and SaaS platforms.

Limitations & Future Work

LLM dependence – While the distilled model keeps latency low, the approach still requires GPU/accelerator resources; smaller teams may need to trade off accuracy for cheaper hardware.
Domain‑specific tuning – The ontology and threshold calibration are handcrafted for the authors’ enterprise; porting TingIS to a new vertical will involve a non‑trivial onboarding effort.
Handling concept drift – As services evolve, the semantic landscape shifts; the authors suggest periodic re‑training of the LLM scorer and updating the ontology, but an automated drift‑detection mechanism remains an open challenge.
Explainability – The LLM‑driven similarity scores are not easily interpretable, which can hinder root‑cause analysis; future work could integrate attention‑based explanations or hybrid symbolic‑neural models.

Overall, TingIS demonstrates that a thoughtfully engineered blend of classic IR techniques and modern LLMs can deliver enterprise‑scale, real‑time risk discovery from noisy customer data—a blueprint many dev‑ops and reliability teams can adapt to their own incident pipelines.

Authors

Jun Wang
Ziyin Zhang
Rui Wang
Hang Yu
Peng Di
Rui Wang

Paper Information

arXiv ID: 2604.21889v1
Categories: cs.CL, cs.AI, cs.LG
Published: April 23, 2026
PDF: Download PDF

[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

[Paper] SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation