Building a RAG-Based AWS VPC Flow Log Analyzer
Source: Dev.to

Introduction
Understanding network traffic inside a Virtual Private Cloud (VPC) directly impacts your security posture, performance visibility, and compliance readiness. Yet most teams still sift through raw flow logs manually, reacting to incidents instead of proactively investigating them.
Rather than grepping through thousands of log lines or exporting data to spreadsheets, we can turn VPC Flow Logs into an interactive layer.
What if you could simply ask your logs questions like this?
- Was that SSH connection rejected?
- Which IP keeps hitting port 443?
- Is this traffic normal or a problem?
In this article we’ll build a Retrieval‑Augmented Generation (RAG) powered VPC Flow Log Analyzer that turns static network telemetry into an interactive security assistant.
The Challenge of Manual Log Analysis
AWS VPC Flow Logs capture essential information about network traffic. However, analysing these raw logs to detect threats (e.g., SQL‑injection attempts or unauthorised access) presents significant challenges:
- Information overload – The sheer volume of logs is overwhelming. Finding specific patterns or anomalies is like searching for a needle in a haystack.
- Context fragmentation – Raw logs lack context. Identifying related packets across different components and time frames is labour‑intensive and error‑prone.
The RAG‑based VPC Flow Log Analyzer addresses these problems with:
- Streamlit – interactive UI
- LangChain – RAG orchestration
- Chroma – vector database
- OpenAI GPT‑4o – reasoning engine
At the end you’ll have a conversational security assistant capable of answering questions such as:
- “Which IPs were rejected?”
- “Was there unusual traffic to port 22?”
- “Which destinations received the most packets?”

Functional Components
| Component | Role | Implementation |
|---|---|---|
| Data Ingestion & Transformation (“Translator”) | Turns raw VPC Flow Log strings (e.g., 2 123... 443 6 ACCEPT) into human‑readable sentences such as “Source 10.0.1.5 sent 1000 bytes to port 443 and was ACCEPTED.” | Custom Python parser |
| Embedding Model (“Encoder”) | Converts each log sentence into a numerical fingerprint (vector) for semantic search | text‑embedding‑3‑small (OpenAI) |
| Vector Database (“Memory”) | Stores the vectors and enables fast similarity search | ChromaDB (local) |
| RAG Orchestration & LLM (“Brain”) | Retrieves relevant vectors, feeds them to the LLM with a prompt, and returns a natural‑language answer | LangChain + GPT‑4o |
| Streamlit Frontend (“Cockpit”) | UI for uploading logs, managing API keys, and chatting with the assistant | Streamlit web framework |
Implementation Steps
1️⃣ Clone the Repository & Set Up a Virtual Environment
git clone https://github.com/Damdev-95/rag_aws_flow_logs
cd rag_aws_flow_logs
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt

2️⃣ Configure Environment Variables
Create a .env file (or export variables) containing your OpenAI API key:
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXX
The code accesses it via:
import os
ENV_API_KEY = os.getenv("OPENAI_API_KEY")
3️⃣ Run the Streamlit Application
streamlit run app.py

Once you click Browse files, you can upload a VPC Flow Log (.txt) and start asking questions.
What to Expect
- Upload a log file → the parser translates each line into a readable sentence.
- Embedding step creates a vector for every sentence and stores it in ChromaDB.
- Chat: type a natural‑language query; LangChain fetches the most relevant vectors and sends them, together with a system prompt, to GPT‑4o.
- Response: the LLM returns a concise answer, optionally highlighting the relevant log entries.
Further Reading & Resources
- GitHub repository –
- Streamlit documentation –
- LangChain docs –
- ChromaDB –
- OpenAI embeddings –
Happy hacking! 🎯
Log File Format
The log file is in TXT format.
Steps to Build the Knowledge Base
-
Select “Build Knowledge Base” – This stores the raw log data in the vector database after it has been converted into vectors.

-
Vector Data Creation – After embedding, the vector data is generated.

-
Index Creation – The index is successfully created after the embedding process.

Sample Query
What is the summary of the flow logs based on traffic accept and reject?

Additional Example Queries with Interaction

Final Result

Stay tuned for additional RAG and Generative AI projects in cloud networking by reading my articles. I look forward to your comments.
