Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

Published: 1 day ago (March 8, 2026 at 02:54 PM EDT)

2 min read

Source: Dev.to

RAG (Retrieval‑Augmented Generation) is how enterprises are deploying LLMs without fine‑tuning. Most tutorials stop at the demo stage, but production RAG requires additional considerations.

RAG vs Fine‑Tuning vs Prompt Engineering

Approach	Cost	Data Freshness	Accuracy	Complexity
RAG	Medium	Real‑time	High (with good retrieval)	Medium
Fine‑Tuning	High	Static (retraining needed)	High	High
Prompt Engineering	Low	Static	Variable	Low

Architecture

The pipeline follows this flow:

Documents → Chunking → Embeddings → Vector Store → Query → Retrieval → LLM → Response

Python Implementation

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
opensearch = boto3.client("opensearchserverless")

def query_knowledge_base(question: str, collection_id: str) -> str:
    # Generate embedding for the question
    embed_response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": question})
    )
    query_embedding = json.loads(embed_response["body"].read())["embedding"]

    # Search OpenSearch vector store
    results = search_vectors(query_embedding, collection_id, k=5)
    context = "\n".join([r["text"] for r in results])

    # Generate answer with context
    prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {question}

Answer:"""

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"]

Hallucination Mitigation

Chunk size matters – 512 tokens with a 50‑token overlap works best.
Hybrid search – combine semantic and keyword (BM25) search.
Citation grounding – force the model to cite source chunks.
Confidence scoring – filter low‑relevance retrievals (cosine similarity)

Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

RAG vs Fine‑Tuning vs Prompt Engineering

Architecture

Python Implementation

Hallucination Mitigation

Related posts

Legal vs Legitimate: How AI Reimplementation is Undermining Copyleft and Open Source Ethics

I built MLShip — deploy your Streamlit or Gradio ML app in 60 seconds. No Docker. No AWS.

Zero-Friction Publishing: A Human-in-the-Loop Agentic CMS powered by Notion MCP

The AI Cold Start That Breaks Kubernetes Autoscaling