I Built a Python CLI Tool for RAG Over Any Document Folder
Source: Dev.to
rag‑cli‑tool – Zero‑config CLI for Retrieval‑Augmented Generation
A zero‑config command‑line tool for retrieval‑augmented generation — index a folder, ask questions, get cited answers. Works locally with Ollama or with cloud APIs.
The Problem
Every time I wanted to ask questions about a set of documents I wrote the same ~100 lines of boilerplate:
- Load docs
- Chunk them
- Embed them
- Store in a vector DB
- Retrieve
- Generate
I got tired of it, so I built a CLI tool that does it in two commands.
RAG prototyping has too much ceremony.
You have a folder of PDFs, Markdown files, maybe some text notes. You want to ask questions about them. Simple enough in theory.
In practice, you’re wiring up document loaders, picking a chunking strategy, initializing an embedding provider, setting up a vector store, writing retrieval logic, and finally getting to the part you actually care about: generating an answer. And you do this every single time you start a new project or want to test a new document set.
Existing solutions sit at the extremes:
- Full frameworks (LangChain, LlamaIndex) are powerful but heavy – you pull in dozens of abstractions just to ask a question about a folder.
- Tutorial notebooks are disposable – they work once, for one demo, and you throw them away.
I wanted something in the middle: a CLI that’s zero‑config for the common case, configurable when you need it, and built from reusable pieces. No framework dependencies. No notebook rot. Just a tool that does one thing well.
What rag‑cli‑tool Gives You
rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"
Point it at a folder → it indexes everything. Ask a question → it answers from your documents.
Supported formats include PDF, Markdown, plain text, and DOCX.
Under the Hood
- Index – loads documents from the directory, splits them into overlapping chunks using a recursive text splitter, generates embeddings, and stores everything in a local ChromaDB instance.
- Ask – embeds your question, retrieves the most similar chunks, and generates an answer using only the retrieved context (strict RAG, no hallucination from external knowledge).
Tech Stack (Deliberately Boring)
| Component | Reason |
|---|---|
| ChromaDB | Runs locally with zero setup – no Docker, no server, just a directory. |
| Typer | CLI framework that gives type‑checked arguments and auto‑generated help for free. |
| Rich | Pretty terminal output – progress bars and formatted answers. |
| Pydantic Settings | Reads config from environment variables and .env files with validation and defaults. |
You can run it fully locally with Ollama (no API keys) or use cloud providers.
Local (no API keys)
RAG_CLI_MODEL=ollama:llama3.2 \
RAG_CLI_EMBEDDING_MODEL=ollama:nomic-embed-text \
rag-cli ask "What are the payment terms?"
Cloud (Anthropic + OpenAI)
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
rag-cli ask "What are the payment terms?"
Repository Layout
src/
├── rag_cli/ # CLI interface (Typer + Rich)
├── llm_core/ # LLM abstraction layer (providers, config, retry)
└── rag_core/ # RAG pipeline (loaders, chunking, embeddings, retrieval)
-
llm_core– Handles everything related to calling language models. It defines a provider interface, implements Anthropic and Ollama adapters, and includes retry logic with exponential backoff. It knows nothing about RAG, documents, or CLI output. -
rag_core– Handles the RAG pipeline: loading documents, chunking text, generating embeddings, storing vectors, and retrieving results. It depends onllm_corefor embedding providers but has no opinion about how you present results to users. -
rag_cli– The thin layer that wires everything together. It handles argument parsing, progress bars, and formatted output. The actual logic is a few lines of glue code.
The separation is practical, not academic. Future projects might be a web app, a Slack bot, or an API service. When that happens I can simply import rag_core (or llm_core) without extracting logic from a monolithic CLI.
Extensible Architecture
Every major component has an abstract base class:
BaseLLMProviderBaseEmbedderBaseChunkerBaseRetrieverBaseVectorStore
Today there is one concrete implementation of each. Tomorrow I can add a GraphRAG retriever or a Pinecone vector store without touching existing code. The abstractions are the minimum interface each component needs to be swappable.
The project has full test coverage across all three packages – 37 tests covering providers, configuration, chunking, embeddings, retrieval, and vector‑store operations.
Four Key Decisions & Their Rationale
-
ChromaDB over FAISS or Pinecone
FAISS requires NumPy gymnastics for persistence and doesn’t store metadata natively.
Pinecone needs an account and network access.
ChromaDB gives a local, persistent vector store with metadata filtering in one line:ChromaStore(persist_dir=path). For an offline CLI tool this was the only real choice. -
Typer over Click
Click is battle‑tested, but Typer lets you define arguments with normal Python type hints—no decorators, no callbacks. Help text writes itself. -
Pydantic Settings for configuration
CLI tools need to read config from environment variables and.envfiles. Pydantic Settings does both, with validation, defaults, and type coercion. One class definition replaces a dozenos.getenv()calls. -
Provider routing via model‑string prefix
Instead of separate config fields for provider selection, the model string itself encodes the provider (e.g.,ollama:llama3.2,anthropic:claude-2). This keeps the CLI surface tiny while remaining extensible.
TL;DR
rag-cli-tool gives you a zero‑config, two‑command workflow for Retrieval‑Augmented Generation:
rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"
It’s built on a clean, test‑covered, three‑package architecture that can be reused in any future AI project. Enjoy fast, local RAG without the ceremony!
Overview
The model string does double duty:
claude-3-5-sonnet-latestroutes to Anthropic.ollama:llama3.2routes to Ollama.
One config field, zero ambiguity. This pattern scales to any number of providers without config proliferation.
Lessons Learned
-
RAG tooling is 80/20 – I expected the infrastructure (vector stores, embedding APIs, retrieval logic) to dominate development time. Instead, chunking decisions took the most effort:
- How big should chunks be?
- How much overlap?
- Which separators produce coherent boundaries?
The pipeline code was straightforward; the tuning was where the real work happened.
-
CLI‑first development forces good API design. When the first consumer is a command‑line interface, you can’t hide behind web‑framework magic. Every input is explicit, every output is visible. This discipline produced cleaner interfaces in
llm_coreandrag_corethan I would have gotten starting with a web app. -
Intentional feature omissions for v0.1:
- Chat mode with conversation history
- Benchmarking against different chunking strategies
- A web UI
- Support for more vector stores
These are reasonable features, but they would have been scope creep. The foundation is solid, the abstractions are in place, and each of those features is an afternoon of work because the architecture supports extension.
-
The best developer tools solve your own problems first.
rag-cli-toolstarted as “I’m tired of writing this boilerplate” and turned into reusable building blocks for my entire AI project portfolio. If you work with documents and want a fast way to prototype RAG pipelines, give it a try.
Install
From PyPI
pip install rag-cli-tool
From source
git clone https://github.com/LukaszGrochal/rag-cli-tool
cd rag-cli-tool
pip install -e .
Quick Start (with Ollama – free, local)
# Pull models
ollama pull llama3.2
ollama pull nomic-embed-text
# Index sample documents
rag-cli index ./sample-docs/
# Ask a question
rag-cli ask "What is the refund policy?"
Links
- PyPI:
- GitHub:
Tags: python, cli, rag, ai, developer-tools