Building Production RAG Pipelines on AWS with Bedrock and OpenSearch

Published: (March 8, 2026 at 02:54 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

RAG (Retrieval‑Augmented Generation) is how enterprises are deploying LLMs without fine‑tuning. Most tutorials stop at the demo stage, but production RAG requires additional considerations.

RAG vs Fine‑Tuning vs Prompt Engineering

ApproachCostData FreshnessAccuracyComplexity
RAGMediumReal‑timeHigh (with good retrieval)Medium
Fine‑TuningHighStatic (retraining needed)HighHigh
Prompt EngineeringLowStaticVariableLow

Architecture

The pipeline follows this flow:

Documents → Chunking → Embeddings → Vector Store → Query → Retrieval → LLM → Response

Python Implementation

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
opensearch = boto3.client("opensearchserverless")

def query_knowledge_base(question: str, collection_id: str) -> str:
    # Generate embedding for the question
    embed_response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": question})
    )
    query_embedding = json.loads(embed_response["body"].read())["embedding"]

    # Search OpenSearch vector store
    results = search_vectors(query_embedding, collection_id, k=5)
    context = "\n".join([r["text"] for r in results])

    # Generate answer with context
    prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {question}

Answer:"""

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1024
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"]

Hallucination Mitigation

  • Chunk size matters – 512 tokens with a 50‑token overlap works best.
  • Hybrid search – combine semantic and keyword (BM25) search.
  • Citation grounding – force the model to cite source chunks.
  • Confidence scoring – filter low‑relevance retrievals (cosine similarity)
0 views
Back to Blog

Related posts

Read more »