[Paper] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach
Source: arXiv - 2602.13156v1
Overview
The paper presents an end‑to‑end incident‑response agent built on a 14‑billion‑parameter large language model (LLM). By harnessing the LLM’s pre‑trained security knowledge and in‑context learning, the system can read raw network logs, infer the current attack state, plan mitigation steps, and execute responses—all without hand‑crafted simulators. The authors demonstrate that this lightweight approach can run on ordinary hardware and recover from incidents up to 23 % faster than existing LLM‑based baselines.
Key Contributions
- Agentic architecture that unifies perception, reasoning, planning, and action within a single LLM.
- In‑context adaptation loop: the model continuously refines its attack hypothesis by comparing simulated outcomes with real observations.
- Fine‑tuning + chain‑of‑thought prompting to enable the LLM to parse unstructured logs and generate structured network‑state representations.
- Hardware‑friendly design: the 14B model fits on commodity GPUs, removing the need for massive compute clusters.
- Empirical evaluation on publicly available incident logs showing a 23 % speed‑up in recovery compared with state‑of‑the‑art LLM agents.
Methodology
- Perception – The LLM receives raw system logs and alerts as a text prompt. Using chain‑of‑thought reasoning, it extracts key entities (IP addresses, timestamps, error codes) and builds a concise “network state” snapshot.
- Reasoning – The model updates an internal attack‑model conjecture (e.g., “lateral movement via SMB exploit”) by matching observed artifacts against its pre‑trained security knowledge base.
- Planning – It simulates the impact of alternative response actions (isolate host, block port, reset credentials) by prompting itself to “run a mental simulation” of the network state after each action.
- Action – The LLM outputs concrete remediation commands (firewall rules, service restarts, forensic collection scripts).
- Feedback Loop – After the actions are executed, new logs are fed back into the model. Discrepancies between the simulated outcome and the observed outcome trigger a revision of the attack hypothesis, and the cycle repeats until the incident is contained.
The entire pipeline is driven by a single LLM that has been lightly fine‑tuned on a curated corpus of incident‑response narratives, enabling it to follow the four‑step workflow without external orchestration components.
Results & Findings
| Metric | Proposed 14B LLM Agent | Prior LLM Baselines |
|---|---|---|
| Mean time to recovery (MTTR) | 23 % faster | Baseline |
| Number of interaction cycles needed to converge on a correct attack model | 2.1 ± 0.4 | 3.4 ± 0.7 |
| Hardware footprint (GPU memory) | ~12 GB (single GPU) | 24 GB+ (multi‑GPU) |
| Success rate on benchmark incident logs (10 cases) | 9/10 resolved | 7/10 resolved |
The agent consistently identified the correct attack vector within two reasoning cycles and generated remediation steps that halted the breach earlier than competing approaches. Importantly, the system required no hand‑crafted simulation environment, relying solely on the LLM’s internal knowledge.
Practical Implications
- Rapid deployment: Security teams can spin up a responsive incident‑response bot on a standard workstation or cloud VM, avoiding the long setup times of RL‑based simulators.
- Reduced engineering overhead: No need to maintain a separate attack‑simulation engine; the LLM handles both inference and “what‑if” analysis.
- Scalable to heterogeneous environments: Because the model works directly on raw logs, it can ingest data from cloud services, container orchestrators, or on‑premise firewalls without custom parsers.
- Augmented SOC workflows: The agent can act as a “first‑line analyst,” surfacing a concise attack hypothesis and recommended actions for human analysts to review, cutting down triage time.
- Cost‑effective: Running a 14B model on a single GPU is far cheaper than maintaining large RL training clusters, making autonomous response accessible to midsize enterprises.
Limitations & Future Work
- Reliance on prompt quality: The agent’s performance degrades if logs are heavily obfuscated or missing critical fields; robust preprocessing pipelines are still needed.
- Explainability: While chain‑of‑thought outputs provide some transparency, the underlying reasoning remains a black‑box LLM, which may hinder auditability in regulated sectors.
- Domain adaptation: The fine‑tuning dataset covers common enterprise attacks; novel or highly targeted threats may require additional domain‑specific data.
- Scalability to massive networks: The current design processes logs sequentially; future work could explore hierarchical prompting or retrieval‑augmented models to handle petabyte‑scale telemetry.
The authors suggest extending the framework with retrieval‑augmented generation (RAG) to incorporate up‑to‑date threat intelligence feeds and integrating human‑in‑the‑loop verification to balance autonomy with compliance requirements.
Authors
- Yiran Gao
- Kim Hammar
- Tao Li
Paper Information
- arXiv ID: 2602.13156v1
- Categories: cs.CR, cs.AI
- Published: February 13, 2026
- PDF: Download PDF