Building a RAG-Based AWS VPC Flow Log Analyzer

Published: 3 days ago (February 28, 2026 at 10:06 AM EST)

4 min read

Source: Dev.to

Cover image for Building a RAG‑Based AWS VPC Flow Log Analyzer

Introduction

Understanding network traffic inside a Virtual Private Cloud (VPC) directly impacts your security posture, performance visibility, and compliance readiness. Yet most teams still sift through raw flow logs manually, reacting to incidents instead of proactively investigating them.

Rather than grepping through thousands of log lines or exporting data to spreadsheets, we can turn VPC Flow Logs into an interactive layer.

What if you could simply ask your logs questions like this?

Was that SSH connection rejected?
Which IP keeps hitting port 443?
Is this traffic normal or a problem?

In this article we’ll build a Retrieval‑Augmented Generation (RAG) powered VPC Flow Log Analyzer that turns static network telemetry into an interactive security assistant.

The Challenge of Manual Log Analysis

AWS VPC Flow Logs capture essential information about network traffic. However, analysing these raw logs to detect threats (e.g., SQL‑injection attempts or unauthorised access) presents significant challenges:

Information overload – The sheer volume of logs is overwhelming. Finding specific patterns or anomalies is like searching for a needle in a haystack.
Context fragmentation – Raw logs lack context. Identifying related packets across different components and time frames is labour‑intensive and error‑prone.

The RAG‑based VPC Flow Log Analyzer addresses these problems with:

Streamlit – interactive UI
LangChain – RAG orchestration
Chroma – vector database
OpenAI GPT‑4o – reasoning engine

At the end you’ll have a conversational security assistant capable of answering questions such as:

“Which IPs were rejected?”
“Was there unusual traffic to port 22?”
“Which destinations received the most packets?”

RAG Workflow

Functional Components

Component	Role	Implementation
Data Ingestion & Transformation (“Translator”)	Turns raw VPC Flow Log strings (e.g., `2 123... 443 6 ACCEPT`) into human‑readable sentences such as “Source 10.0.1.5 sent 1000 bytes to port 443 and was ACCEPTED.”	Custom Python parser
Embedding Model (“Encoder”)	Converts each log sentence into a numerical fingerprint (vector) for semantic search	`text‑embedding‑3‑small` (OpenAI)
Vector Database (“Memory”)	Stores the vectors and enables fast similarity search	ChromaDB (local)
RAG Orchestration & LLM (“Brain”)	Retrieves relevant vectors, feeds them to the LLM with a prompt, and returns a natural‑language answer	LangChain + GPT‑4o
Streamlit Frontend (“Cockpit”)	UI for uploading logs, managing API keys, and chatting with the assistant	Streamlit web framework

Implementation Steps

1️⃣ Clone the Repository & Set Up a Virtual Environment

git clone https://github.com/Damdev-95/rag_aws_flow_logs
cd rag_aws_flow_logs

python -m venv venv
source venv/bin/activate   # On Windows use `venv\Scripts\activate`

pip install -r requirements.txt

Workspace Code

2️⃣ Configure Environment Variables

Create a .env file (or export variables) containing your OpenAI API key:

OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXX

The code accesses it via:

import os
ENV_API_KEY = os.getenv("OPENAI_API_KEY")

3️⃣ Run the Streamlit Application

streamlit run app.py

Web Application

Once you click Browse files, you can upload a VPC Flow Log (.txt) and start asking questions.

What to Expect

Upload a log file → the parser translates each line into a readable sentence.
Embedding step creates a vector for every sentence and stores it in ChromaDB.
Chat: type a natural‑language query; LangChain fetches the most relevant vectors and sends them, together with a system prompt, to GPT‑4o.
Response: the LLM returns a concise answer, optionally highlighting the relevant log entries.

Log File Format

The log file is in TXT format.

Steps to Build the Knowledge Base

Select “Build Knowledge Base” – This stores the raw log data in the vector database after it has been converted into vectors.
Vector Data Creation – After embedding, the vector data is generated.
Index Creation – The index is successfully created after the embedding process.