Building a RAG-Based AWS VPC Flow Log Analyzer

Published: (February 28, 2026 at 10:06 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Cover image for Building a RAG‑Based AWS VPC Flow Log Analyzer

Sulaiman Olubiyi

Introduction

Understanding network traffic inside a Virtual Private Cloud (VPC) directly impacts your security posture, performance visibility, and compliance readiness. Yet most teams still sift through raw flow logs manually, reacting to incidents instead of proactively investigating them.

Rather than grepping through thousands of log lines or exporting data to spreadsheets, we can turn VPC Flow Logs into an interactive layer.

What if you could simply ask your logs questions like this?

  • Was that SSH connection rejected?
  • Which IP keeps hitting port 443?
  • Is this traffic normal or a problem?

In this article we’ll build a Retrieval‑Augmented Generation (RAG) powered VPC Flow Log Analyzer that turns static network telemetry into an interactive security assistant.

The Challenge of Manual Log Analysis

AWS VPC Flow Logs capture essential information about network traffic. However, analysing these raw logs to detect threats (e.g., SQL‑injection attempts or unauthorised access) presents significant challenges:

  • Information overload – The sheer volume of logs is overwhelming. Finding specific patterns or anomalies is like searching for a needle in a haystack.
  • Context fragmentation – Raw logs lack context. Identifying related packets across different components and time frames is labour‑intensive and error‑prone.

The RAG‑based VPC Flow Log Analyzer addresses these problems with:

  • Streamlit – interactive UI
  • LangChain – RAG orchestration
  • Chroma – vector database
  • OpenAI GPT‑4o – reasoning engine

At the end you’ll have a conversational security assistant capable of answering questions such as:

  • “Which IPs were rejected?”
  • “Was there unusual traffic to port 22?”
  • “Which destinations received the most packets?”

RAG Workflow

Functional Components

ComponentRoleImplementation
Data Ingestion & Transformation (“Translator”)Turns raw VPC Flow Log strings (e.g., 2 123... 443 6 ACCEPT) into human‑readable sentences such as “Source 10.0.1.5 sent 1000 bytes to port 443 and was ACCEPTED.”Custom Python parser
Embedding Model (“Encoder”)Converts each log sentence into a numerical fingerprint (vector) for semantic searchtext‑embedding‑3‑small (OpenAI)
Vector Database (“Memory”)Stores the vectors and enables fast similarity searchChromaDB (local)
RAG Orchestration & LLM (“Brain”)Retrieves relevant vectors, feeds them to the LLM with a prompt, and returns a natural‑language answerLangChain + GPT‑4o
Streamlit Frontend (“Cockpit”)UI for uploading logs, managing API keys, and chatting with the assistantStreamlit web framework

Implementation Steps

1️⃣ Clone the Repository & Set Up a Virtual Environment

git clone https://github.com/Damdev-95/rag_aws_flow_logs
cd rag_aws_flow_logs

python -m venv venv
source venv/bin/activate   # On Windows use `venv\Scripts\activate`

pip install -r requirements.txt

Workspace Code

2️⃣ Configure Environment Variables

Create a .env file (or export variables) containing your OpenAI API key:

OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXX

The code accesses it via:

import os
ENV_API_KEY = os.getenv("OPENAI_API_KEY")

3️⃣ Run the Streamlit Application

streamlit run app.py

Web Application

Once you click Browse files, you can upload a VPC Flow Log (.txt) and start asking questions.

What to Expect

  • Upload a log file → the parser translates each line into a readable sentence.
  • Embedding step creates a vector for every sentence and stores it in ChromaDB.
  • Chat: type a natural‑language query; LangChain fetches the most relevant vectors and sends them, together with a system prompt, to GPT‑4o.
  • Response: the LLM returns a concise answer, optionally highlighting the relevant log entries.

Further Reading & Resources

  • GitHub repository
  • Streamlit documentation
  • LangChain docs
  • ChromaDB
  • OpenAI embeddings

Happy hacking! 🎯

Log File Format

The log file is in TXT format.

Steps to Build the Knowledge Base

  1. Select “Build Knowledge Base” – This stores the raw log data in the vector database after it has been converted into vectors.

    Build Knowledge Base screen

  2. Vector Data Creation – After embedding, the vector data is generated.

    Vector data view

  3. Index Creation – The index is successfully created after the embedding process.

    Index creation screen

Sample Query

What is the summary of the flow logs based on traffic accept and reject?

Demo of query execution

Additional Example Queries with Interaction

Nice examples of queries

Final Result

Final output screenshot

Stay tuned for additional RAG and Generative AI projects in cloud networking by reading my articles. I look forward to your comments.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...