[Paper] TAAF: A Trace Abstraction and Analysis Framework Synergizing Knowledge Graphs and LLMs

Published: (January 5, 2026 at 08:04 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02632v1

Overview

The paper presents TAAF (Trace Abstraction and Analysis Framework), a new way to turn massive, low‑level execution traces—think kernel logs from Chrome, MySQL, or an OS scheduler—into concise, queryable insights. By marrying time‑indexed knowledge graphs with large language models (LLMs), TAAF lets developers ask natural‑language questions about a trace and receive accurate answers without writing custom scripts.

Key Contributions

  • Time‑indexed Knowledge Graph (KG) construction that captures temporal and relational information among threads, CPUs, I/O devices, and other system entities directly from raw trace events.
  • LLM‑driven query engine that extracts the relevant sub‑graph for a user’s natural‑language question and generates a precise answer, handling multi‑hop and causal reasoning.
  • TraceQA‑100 benchmark: a curated set of 100 realistic questions grounded in real kernel traces, enabling systematic evaluation of trace‑analysis tools.
  • Empirical gains: Across three LLM back‑ends and several temporal slicing strategies, TAAF improves answer accuracy by up to 31.2 % over baseline script‑based analysis, especially on complex, multi‑step queries.
  • Error‑analysis framework that pinpoints when graph‑grounded reasoning helps versus when LLM hallucination or graph incompleteness hurts performance.

Methodology

  1. Trace Ingestion & Normalization – Raw logs are parsed into atomic events (e.g., “thread T1 scheduled on CPU C2 at t=1234”).
  2. Temporal Indexing – Events are bucketed into sliding windows (e.g., 1 ms, 10 ms) to preserve ordering while keeping the graph size manageable.
  3. KG Construction – Nodes represent entities (threads, processes, resources) and edges encode relationships (“runs‑on”, “locks”, “writes‑to”) together with timestamps.
  4. Query Processing
    • The user writes a natural‑language question (e.g., “Which thread caused the CPU stall at 5 s?”).
    • A lightweight retriever selects the time window(s) most likely relevant.
    • The corresponding sub‑graph is serialized (node/edge list + timestamps) and fed to an LLM prompt that includes a short “graph‑to‑text” schema.
    • The LLM produces a concise answer, optionally with a justification trace.
  5. Evaluation – Answers are compared against ground‑truth in TraceQA‑100 using exact‑match and F1 metrics. Experiments vary the LLM (GPT‑4, Claude‑2, Llama‑2) and the temporal granularity of the KG.

Results & Findings

SettingBaseline (script‑only)TAAF (best LLM)Δ Accuracy
Single‑hop factual Qs78.4 %85.9 %+7.5 %
Multi‑hop reasoning62.1 %84.3 %+22.2 %
Causal chain (e.g., “What triggered X?”)55.0 %86.2 %+31.2 %
Varying window size (10 ms vs. 1 s)Small windows improve precision for fine‑grained bugs, large windows help high‑level performance queries.
  • Graph grounding shines on questions that require stitching together several events (e.g., “Did thread A pre‑empt thread B before the deadlock?”).
  • LLM choice matters: GPT‑4 consistently outperformed the open‑source Llama‑2, but Claude‑2 showed better resistance to hallucinations on noisy sub‑graphs.
  • Failure modes: When the KG omitted rare system calls or when timestamps were coarse, the LLM sometimes fabricated plausible‑looking but incorrect answers.

Practical Implications

  • Reduced debugging toil – Engineers can ask “Why did request #42 take 200 ms?” and get a trace‑backed answer without writing a custom parser.
  • Accelerated performance tuning – Performance teams can query “Which CPU core experienced the highest cache‑miss rate during the load test?” and instantly receive a ranked list.
  • Cross‑team knowledge sharing – Ops, security, and development can use a common natural‑language interface to explore the same trace data, lowering the barrier for non‑kernel experts.
  • Tool integration – TAAF’s KG can be exported to Neo4j or GraphQL endpoints, enabling existing observability stacks (Grafana, Elastic) to embed LLM‑driven insights.
  • Cost‑effective analysis – By limiting LLM calls to focused sub‑graphs rather than the whole trace, the framework keeps API usage (and thus cloud cost) modest.

Limitations & Future Work

  • Scalability of KG – Extremely long‑running traces (hours of kernel activity) still produce graphs that strain memory; incremental pruning or summarization is needed.
  • LLM hallucination risk – When the graph is incomplete, the model may “fill in gaps” with plausible but false statements; tighter grounding checks are an open research direction.
  • Domain‑specific vocabularies – The current prompt templates assume generic OS concepts; extending to specialized domains (e.g., GPU drivers, distributed databases) will require custom schema definitions.
  • Benchmark breadth – TraceQA‑100 focuses on kernel traces; future benchmarks should cover user‑space logs, cloud‑native microservice traces, and security‑oriented events.

Bottom line: TAAF demonstrates that coupling structured, time‑aware graphs with LLM reasoning can dramatically improve the accessibility and accuracy of trace analysis, opening the door for smarter, developer‑friendly observability tools.

Authors

  • Alireza Ezaz
  • Ghazal Khodabandeh
  • Majid Babaei
  • Naser Ezzati-Jivan

Paper Information

  • arXiv ID: 2601.02632v1
  • Categories: cs.SE, cs.AI
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »