[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
Source: arXiv - 2604.21889v1
Overview
The paper introduces TingIS, a production‑grade system that turns noisy, high‑volume customer incident reports into real‑time risk alerts for large cloud‑native services. By marrying fast indexing with Large Language Models (LLMs) and a layered noise‑filtering pipeline, TingIS can surface actionable incidents within minutes—crucial for preventing costly outages.
Key Contributions
- Hybrid event‑linking engine: Combines traditional similarity indexing with LLM‑driven semantic reasoning to decide when disparate incident messages belong to the same underlying risk event.
- Cascaded business‑routing architecture: Dynamically attributes incidents to the correct product line or service domain, improving downstream triage.
- Multi‑dimensional noise‑reduction pipeline: Leverages domain ontologies, statistical outlier detection, and user‑behavioral signals to suppress irrelevant chatter while preserving rare, high‑impact reports.
- Scalable production deployment: Handles >2 k messages/min (≈300 k/day) with a 90th‑percentile alert latency of 3.5 min and a 95 % discovery rate for high‑priority incidents.
- Empirical validation: Benchmarks on real‑world incident streams show superior routing accuracy, clustering quality, and signal‑to‑noise ratio compared with baseline clustering or rule‑based systems.
Methodology
- Ingestion & Indexing – Incoming incident tickets are first tokenized and stored in an approximate nearest‑neighbor (ANN) index (e.g., HNSW). This provides sub‑millisecond candidate retrieval for any new message.
- LLM‑augmented similarity scoring – For each candidate pair, a lightweight LLM (e.g., a distilled transformer) generates a semantic similarity score that captures nuanced business terminology, abbreviations, and context that pure lexical metrics miss.
- Event linking decision – A calibrated threshold (learned from historical labeled incidents) determines whether two messages should be merged into a single “risk event”. The system operates in a streaming fashion, updating clusters incrementally.
- Cascaded routing – Once an event is formed, a hierarchy of classifiers (rule‑based filters → shallow ML models → LLM‑based intent recognizer) routes the event to the appropriate service team or escalation path.
- Noise reduction – Three orthogonal filters prune spurious data:
- Domain knowledge filter – Uses a curated ontology of known error codes, service names, and de‑duplication patterns.
- Statistical filter – Flags outliers based on frequency, temporal burstiness, and historical severity distributions.
- Behavioral filter – Discounts reports from users with low trust scores or repetitive low‑severity submissions.
- Alert generation – Cleaned, routed events trigger alerts via existing incident‑management APIs, respecting SLA latency budgets.
Results & Findings
| Metric | TingIS | Baseline (rule‑based clustering) |
|---|---|---|
| Routing accuracy | 92 % | 71 % |
| Clustering F1 | 0.84 | 0.61 |
| Signal‑to‑Noise Ratio | 4.7× improvement | – |
| P90 alert latency | 3.5 min | 9.2 min |
| High‑priority discovery rate | 95 % | 68 % |
The authors also report that the LLM‑enhanced similarity step adds only ~15 ms per candidate pair, keeping the end‑to‑end pipeline well within the required latency budget. Real‑world A/B tests showed a measurable reduction in mean time to resolution (MTTR) for critical incidents.
Practical Implications
- Faster incident response – Developers can rely on TingIS to surface emerging problems before they manifest as full‑blown outages, shaving minutes off MTTR.
- Reduced alert fatigue – By aggressively filtering noise, on‑call engineers receive fewer false positives, allowing them to focus on truly risky events.
- Cross‑service visibility – The routing layer automatically maps incidents to the correct product team, eliminating manual triage steps that often delay remediation.
- Plug‑and‑play architecture – The system is built on open‑source ANN libraries and LLM inference servers, making it adaptable to any organization that already collects customer‑facing tickets (e.g., Slack, Jira, email).
- Cost savings – Early detection of high‑impact anomalies can prevent costly downtime, translating into direct financial ROI for cloud providers and SaaS platforms.
Limitations & Future Work
- LLM dependence – While the distilled model keeps latency low, the approach still requires GPU/accelerator resources; smaller teams may need to trade off accuracy for cheaper hardware.
- Domain‑specific tuning – The ontology and threshold calibration are handcrafted for the authors’ enterprise; porting TingIS to a new vertical will involve a non‑trivial onboarding effort.
- Handling concept drift – As services evolve, the semantic landscape shifts; the authors suggest periodic re‑training of the LLM scorer and updating the ontology, but an automated drift‑detection mechanism remains an open challenge.
- Explainability – The LLM‑driven similarity scores are not easily interpretable, which can hinder root‑cause analysis; future work could integrate attention‑based explanations or hybrid symbolic‑neural models.
Overall, TingIS demonstrates that a thoughtfully engineered blend of classic IR techniques and modern LLMs can deliver enterprise‑scale, real‑time risk discovery from noisy customer data—a blueprint many dev‑ops and reliability teams can adapt to their own incident pipelines.
Authors
- Jun Wang
- Ziyin Zhang
- Rui Wang
- Hang Yu
- Peng Di
- Rui Wang
Paper Information
- arXiv ID: 2604.21889v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: April 23, 2026
- PDF: Download PDF