Building a RAG System from Scratch: Turning Aviation Disruption Data into an AI-Powered Q&A App

Published: 1 month ago (March 9, 2026 at 12:37 AM EDT)

11 min read

Source: Dev.to

Source: Dev.to

Below is a clean, well‑structured Markdown template you can use to present the content from the Dev.to article.
Just paste the article’s text (or the raw Markdown you have) into the appropriate sections, and the formatting will be consistent and easy to read.

Building a RAG System from Scratch

Turning Aviation‑Disruption Data into an AI‑Powered Q&A App

Source: Dev.to – Building a RAG System from Scratch

Introduction
Why Retrieval‑Augmented Generation (RAG)?
Dataset Overview
Architecture Diagram
Step‑by‑Step Implementation
- 5.1. Data Ingestion & Cleaning
- 5.2. Embedding Generation
- 5.3. Vector Store Setup
- 5.4. Retrieval Layer
- 5.5. LLM Prompt Engineering
- 5.6. Putting It All Together
Running the App Locally
Deployment Options
Challenges & Lessons Learned
Future Work
References

Introduction

Briefly describe the problem you’re solving – e.g., providing quick, accurate answers to aviation disruption queries using a Retrieval‑Augmented Generation (RAG) pipeline.

Why Retrieval‑Augmented Generation (RAG)?

Combines the factual grounding of a vector store with the generative power of LLMs.
Reduces hallucinations by grounding responses in real data.
Scalable to large, constantly‑updating datasets (e.g., flight‑status logs).

Dataset Overview

Dataset	Source	Size	Key Fields
Aviation Disruption Logs	XYZ API / CSV	~10k rows	`flight_id`, `date`, `delay_minutes`, `reason`, `airport`
…	…	…	…

Add any preprocessing steps you performed (e.g., handling missing values, normalising timestamps).

Architecture Diagram

(Insert a Mermaid diagram or an image here.)

flowchart LR
    A[User Query] --> B[Retriever (FAISS/PGVector)]
    B --> C[Relevant Docs]
    C --> D[LLM (OpenAI / Llama2)]
    D --> E[Generated Answer]
    E --> A

Step‑by‑Step Implementation

5.1. Data Ingestion & Cleaning

import pandas as pd

df = pd.read_csv("aviation_disruption.csv")
df.dropna(subset=["flight_id", "reason"], inplace=True)
df["date"] = pd.to_datetime(df["date"])
# Additional cleaning steps...

5.2. Embedding Generation

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(df["reason"].tolist(), show_progress_bar=True)

5.3. Vector Store Setup

import faiss
import numpy as np

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings).astype("float32"))

5.4. Retrieval Layer

def retrieve(query, k=5):
    q_emb = model.encode([query])
    D, I = index.search(q_emb.astype("float32"), k)
    return df.iloc[I[0]]

5.5. LLM Prompt Engineering

SYSTEM_PROMPT = """You are an aviation expert. Answer the user's question using only the provided context. If the answer is not in the context, say you don't know."""

5.6. Putting It All Together

def rag_answer(query):
    docs = retrieve(query)
    context = "\n".join(docs["reason"].tolist())
    prompt = f"{SYSTEM_PROMPT}\nContext:\n{context}\n\nQuestion: {query}"
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": SYSTEM_PROMPT},
                  {"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Running the App Locally

Clone the repo

git clone https://github.com/yourname/rag-aviation.git
cd rag-aviation

Create a virtual environment

python -m venv .venv
source .venv/bin/activate   # on Windows: .venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set your OpenAI API key
```
export OPENAI_API_KEY=sk-...
```
Start the FastAPI server
```
uvicorn app:app --reload
```
Open http://localhost:8000/docs to test the endpoint.

Deployment Options

Docker – build an image with Dockerfile and push to a container registry.
AWS Lambda + API Gateway – serverless, low‑cost for low traffic.
Render / Fly.io – simple “one‑click” deployments for FastAPI apps.

Challenges & Lessons Learned

Challenge	Solution
Large context size exceeds token limit	Chunk documents and use a “top‑k” retrieval strategy.
Inconsistent terminology in the raw data	Built a synonym dictionary and applied fuzzy matching.
Latency of embedding generation	Pre‑computed embeddings and stored them in a persistent vector DB.

Future Work

Incremental indexing for real‑time flight updates.
Hybrid search (BM25 + dense vectors) for better recall.
User feedback loop to fine‑tune the LLM on domain‑specific phrasing.

References

RAG Paper – Lewis et al., “Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks”, 2020.
FAISS – Facebook AI Similarity Search library.
Sentence‑Transformers – Reimers & Gurevych, 2020.
OpenAI API Documentation – https://platform.openai.com/docs

Feel free to replace the placeholder code and tables with the actual snippets and data from the article. The structure above ensures the Markdown remains clean, navigable, and ready for publishing on any platform.

Retrieval‑Augmented Generation (RAG) for the 2026 Iran‑US Conflict’s Impact on Global Civil Aviation

Live demo:
Source code:

In this article I’ll walk through the architecture, the decisions I made, and what I learned along the way.

The Problem

The Global Civil Aviation Disruption 2026 dataset on Kaggle contains six CSV files (218 records total) covering:

Category	Description
Airline financial losses
Airport disruptions
Airspace closures
Flight cancellations
Reroutes
Timeline of conflict events

Raw CSV data isn’t user‑friendly. To answer questions like “Which airline suffered the most?” or “What airports in Iran were closed?” you’d have to manually dig through spreadsheets. I wanted to make this data conversational—ask a question, get a clear answer with sources. That’s exactly what RAG does.

What is RAG?

RAG (Retrieval‑Augmented Generation) combines two steps:

Step	Description
Retrieval	Find the most relevant pieces of information from your data
Generation	Feed those pieces to an LLM to produce a human‑readable answer

The key insight: instead of fine‑tuning a model on your data (expensive, slow), you give the LLM the right context at query time. The model doesn’t need to “know” your data—it just needs to read it.

Architecture

CSV Files (6 tables, 218 records)
   → Python ingestion script converts each row to natural language
   → HuggingFace sentence‑transformers embeds each chunk (all‑MiniLM‑L6‑v2)
   → ChromaDB stores the vectors locally
   → FastAPI serves the /query endpoint
   → Angular frontend provides the chat UI
   → Deployed on Hugging Face Spaces (Docker)

Tech Stack

Layer	Tool	Why
Orchestration	LangChain	Mature RAG framework, pluggable components
Embeddings	HuggingFace all‑MiniLM‑L6‑v2	Fast, runs on CPU, no GPU needed
Vector Store	ChromaDB	Zero‑config, file‑based, perfect for small‑medium datasets
LLM	OpenAI GPT‑4o	Best answer quality for generation
API	FastAPI	Async, auto‑generates Swagger docs, production‑ready
Frontend	Angular	Integrated into my existing portfolio site
Deployment	Hugging Face Spaces (Docker)	Free tier, auto‑scaling, git‑based deploys

The Interesting Part: Structured Data + RAG

Most RAG tutorials use PDFs or plain‑text documents. My dataset is structured CSV data—rows and columns, not paragraphs. This required an extra step: converting each row into a natural‑language sentence before embedding.

Example row from airline_losses_estimate.csv

Emirates, UAE, 4200000, 18, 62, 2835200, 9180

Converted to:

“Emirates (UAE) faces an estimated daily financial loss of $4,200,000 USD due to the Iran‑US conflict. 18 flights were cancelled and 62 were rerouted, incurring $2,835,200 in additional fuel costs. Approximately 9,180 passengers were impacted.”

Embedding models understand natural language, not raw CSV columns. Each of the six CSV files has its own conversion function that produces a descriptive sentence with all the context needed for retrieval.

Building It: Step‑by‑Step

1. Ingestion

The ingestion script reads all six CSVs, converts each row to a natural‑language chunk, and stores it in ChromaDB with metadata (source file, category, original field values).

# Each CSV file has a dedicated row‑to‑text converter
def row_to_text_airline_losses(row):
    return (
        f"{row['airline']} ({row['country']}) faces an estimated daily "
        f"financial loss of ${row['estimated_daily_loss_usd']:,.0f} USD..."
    )

Result: 218 documents across six categories — small enough for a single ChromaDB collection, large enough to need proper retrieval.

2. Embedding

I used all‑MiniLM‑L6‑v2 from HuggingFace’s sentence‑transformers. It produces 384‑dimensional vectors and runs comfortably on CPU (no GPU, no cloud embedding API, no cost).

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
)

3. Retrieval + Generation

At query time, the user’s question is embedded with the same model, and ChromaDB returns the top‑k most similar chunks. Those chunks are injected into a prompt template and sent to GPT‑4o:

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The prompt instructs the model to act as an aviation intelligence analyst and answer using ONLY the provided context—no hallucination.

4. API

FastAPI wraps the RAG pipeline into a clean REST endpoint:

POST /query
{
  "question": "Which airline had the highest financial loss?",
  "k": 5
}

The response includes the answer and the source documents used to generate it, providing full transparency.

5. Deployment

The entire system is containerized with Docker and deployed on Hugging Face Spaces (free tier). The vector store is built during the Docker build phase, so it’s baked into the image—no cold‑start database initialization.

What I Learned

Structured data needs extra love in RAG.
You can’t just throw CSVs at an embedding model. Converting rows to natural‑language sentences dramatically improves retrieval quality.
Embedding choice matters.
A lightweight, CPU‑friendly model (all‑MiniLM‑L6‑v2) is sufficient for small tabular datasets and keeps costs near zero.
Metadata is a lifesaver.
Storing the original fields as metadata lets you surface the raw numbers alongside the LLM’s answer, giving users confidence in the result.
Prompt engineering prevents hallucination.
Explicitly telling the model to answer only from the supplied context (and to cite sources) reduces the risk of fabricated facts.
Docker‑time vector store simplifies deployment.
Pre‑building the ChromaDB collection during image creation eliminates the need for a separate data‑loading step on startup, which is crucial for free‑tier hosting where cold starts are penalized.

Final Thoughts

RAG isn’t just for PDFs or web pages—any structured dataset can become a conversational knowledge base with a little preprocessing. By turning each CSV row into a concise, natural‑language description, I was able to leverage a cheap embedding model and a powerful LLM (GPT‑4o) to answer real‑world aviation‑intelligence questions in seconds, all while keeping the whole stack free and open‑source.

Feel free to explore the live demo, fork the source code, or adapt the pipeline to your own tabular data!

Here’s the same content, tidied up with consistent Markdown formatting while preserving the original structure and wording:

4. **Prompt engineering prevents hallucination.**  
   Explicitly telling the LLM to answer *only* with the supplied context (and to cite sources) keeps the system trustworthy.

5. **Dockerizing the vector store eliminates cold starts.**  
   Pre‑building the ChromaDB collection during image creation means the API is ready to serve queries instantly.

---

Feel free to explore the demo, fork the repo, and adapt the pattern to any other structured‑data domain you care about!

Tips for Building a Small‑Scale RAG System

You don’t need a GPU for embeddings.
all‑MiniLM‑L6‑v2 runs in milliseconds on a CPU for modest datasets. Don’t over‑engineer the infrastructure.
ChromaDB is perfect for prototyping.
Zero‑config, runs embedded in your Python process, and persists to disk. For 218 documents it’s essentially instant.
Hugging Face Spaces is underrated for API hosting.
Free Docker‑based deployment with auto‑generated URLs. The only trade‑off is the cold‑start after inactivity (≈ 30‑60 s).
Context‑stuffing beats RAG for small data.
I also built a portfolio‑chatbot endpoint on the same API – it simply stuffs the entire markdown file into the system prompt. No embeddings, no vector store. When your data fits in the context window, keep it simple.

Try It Yourself

Live Demo: (link placeholder)

Example questions to try:

“Which airline suffered the highest daily financial loss?”
“What airports in Iran were closed?”
“How many flights were cancelled from Dubai on March 1st?”
“What was the aviation impact of the Natanz airstrike?”
“Which countries closed their airspace and for how long?”

Source Code: (link placeholder)

API Docs: (link placeholder)

What’s Next

Adding hybrid search (vector + keyword) via Azure AI Search for better retrieval.
Exploring streaming responses for a more interactive chat experience.
Evaluating retrieval quality with metrics like precision@k and MRR.

If you’re building your first RAG system, start small—a few CSVs, a local vector store, and a cloud LLM. Get the pipeline working end‑to‑end, then optimise. The fundamentals transfer directly to production‑scale systems.

Built with Python, LangChain, ChromaDB, Hugging Face, OpenAI GPT‑4o, FastAPI, Angular, and Hugging Face Spaces.

Connect with me on LinkedIn or check out more projects on GitHub.