I Built a Python CLI Tool for RAG Over Any Document Folder

Published: 3 days ago (February 9, 2026 at 02:01 AM EST)

7 min read

Source: Dev.to

rag‑cli‑tool – Zero‑config CLI for Retrieval‑Augmented Generation

A zero‑config command‑line tool for retrieval‑augmented generation — index a folder, ask questions, get cited answers. Works locally with Ollama or with cloud APIs.

The Problem

Every time I wanted to ask questions about a set of documents I wrote the same ~100 lines of boilerplate:

Load docs
Chunk them
Embed them
Store in a vector DB
Retrieve
Generate

I got tired of it, so I built a CLI tool that does it in two commands.

RAG prototyping has too much ceremony.
You have a folder of PDFs, Markdown files, maybe some text notes. You want to ask questions about them. Simple enough in theory.
In practice, you’re wiring up document loaders, picking a chunking strategy, initializing an embedding provider, setting up a vector store, writing retrieval logic, and finally getting to the part you actually care about: generating an answer. And you do this every single time you start a new project or want to test a new document set.

Existing solutions sit at the extremes:

Full frameworks (LangChain, LlamaIndex) are powerful but heavy – you pull in dozens of abstractions just to ask a question about a folder.
Tutorial notebooks are disposable – they work once, for one demo, and you throw them away.

I wanted something in the middle: a CLI that’s zero‑config for the common case, configurable when you need it, and built from reusable pieces. No framework dependencies. No notebook rot. Just a tool that does one thing well.

What rag‑cli‑tool Gives You

rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"

Point it at a folder → it indexes everything. Ask a question → it answers from your documents.

Supported formats include PDF, Markdown, plain text, and DOCX.

Under the Hood

Index – loads documents from the directory, splits them into overlapping chunks using a recursive text splitter, generates embeddings, and stores everything in a local ChromaDB instance.
Ask – embeds your question, retrieves the most similar chunks, and generates an answer using only the retrieved context (strict RAG, no hallucination from external knowledge).

Tech Stack (Deliberately Boring)

Component	Reason
ChromaDB	Runs locally with zero setup – no Docker, no server, just a directory.
Typer	CLI framework that gives type‑checked arguments and auto‑generated help for free.
Rich	Pretty terminal output – progress bars and formatted answers.
Pydantic Settings	Reads config from environment variables and `.env` files with validation and defaults.

You can run it fully locally with Ollama (no API keys) or use cloud providers.

Local (no API keys)

RAG_CLI_MODEL=ollama:llama3.2 \
RAG_CLI_EMBEDDING_MODEL=ollama:nomic-embed-text \
rag-cli ask "What are the payment terms?"

Cloud (Anthropic + OpenAI)

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
rag-cli ask "What are the payment terms?"

Repository Layout

src/
├── rag_cli/       # CLI interface (Typer + Rich)
├── llm_core/      # LLM abstraction layer (providers, config, retry)
└── rag_core/     # RAG pipeline (loaders, chunking, embeddings, retrieval)

llm_core – Handles everything related to calling language models. It defines a provider interface, implements Anthropic and Ollama adapters, and includes retry logic with exponential backoff. It knows nothing about RAG, documents, or CLI output.
rag_core – Handles the RAG pipeline: loading documents, chunking text, generating embeddings, storing vectors, and retrieving results. It depends on llm_core for embedding providers but has no opinion about how you present results to users.
rag_cli – The thin layer that wires everything together. It handles argument parsing, progress bars, and formatted output. The actual logic is a few lines of glue code.

The separation is practical, not academic. Future projects might be a web app, a Slack bot, or an API service. When that happens I can simply import rag_core (or llm_core) without extracting logic from a monolithic CLI.

Extensible Architecture

Every major component has an abstract base class:

BaseLLMProvider
BaseEmbedder
BaseChunker
BaseRetriever
BaseVectorStore

Today there is one concrete implementation of each. Tomorrow I can add a GraphRAG retriever or a Pinecone vector store without touching existing code. The abstractions are the minimum interface each component needs to be swappable.

The project has full test coverage across all three packages – 37 tests covering providers, configuration, chunking, embeddings, retrieval, and vector‑store operations.

Four Key Decisions & Their Rationale

ChromaDB over FAISS or Pinecone
FAISS requires NumPy gymnastics for persistence and doesn’t store metadata natively.
Pinecone needs an account and network access.
ChromaDB gives a local, persistent vector store with metadata filtering in one line: ChromaStore(persist_dir=path). For an offline CLI tool this was the only real choice.
Typer over Click
Click is battle‑tested, but Typer lets you define arguments with normal Python type hints—no decorators, no callbacks. Help text writes itself.
Pydantic Settings for configuration
CLI tools need to read config from environment variables and .env files. Pydantic Settings does both, with validation, defaults, and type coercion. One class definition replaces a dozen os.getenv() calls.
Provider routing via model‑string prefix
Instead of separate config fields for provider selection, the model string itself encodes the provider (e.g., ollama:llama3.2, anthropic:claude-2). This keeps the CLI surface tiny while remaining extensible.

TL;DR

rag-cli-tool gives you a zero‑config, two‑command workflow for Retrieval‑Augmented Generation:

rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"

It’s built on a clean, test‑covered, three‑package architecture that can be reused in any future AI project. Enjoy fast, local RAG without the ceremony!

Overview

The model string does double duty:

claude-3-5-sonnet-latest routes to Anthropic.
ollama:llama3.2 routes to Ollama.

One config field, zero ambiguity. This pattern scales to any number of providers without config proliferation.

Lessons Learned

RAG tooling is 80/20 – I expected the infrastructure (vector stores, embedding APIs, retrieval logic) to dominate development time. Instead, chunking decisions took the most effort:
- How big should chunks be?
- How much overlap?
- Which separators produce coherent boundaries?
The pipeline code was straightforward; the tuning was where the real work happened.
CLI‑first development forces good API design. When the first consumer is a command‑line interface, you can’t hide behind web‑framework magic. Every input is explicit, every output is visible. This discipline produced cleaner interfaces in llm_core and rag_core than I would have gotten starting with a web app.
Intentional feature omissions for v0.1:
- Chat mode with conversation history
- Benchmarking against different chunking strategies
- A web UI
- Support for more vector stores
These are reasonable features, but they would have been scope creep. The foundation is solid, the abstractions are in place, and each of those features is an afternoon of work because the architecture supports extension.
The best developer tools solve your own problems first. rag-cli-tool started as “I’m tired of writing this boilerplate” and turned into reusable building blocks for my entire AI project portfolio. If you work with documents and want a fast way to prototype RAG pipelines, give it a try.

Install

From PyPI

pip install rag-cli-tool

From source

git clone https://github.com/LukaszGrochal/rag-cli-tool
cd rag-cli-tool
pip install -e .

Quick Start (with Ollama – free, local)

# Pull models
ollama pull llama3.2
ollama pull nomic-embed-text

# Index sample documents
rag-cli index ./sample-docs/

# Ask a question
rag-cli ask "What is the refund policy?"