Vertex AI RAG Engine Advanced RAG with Terraform: Chunking, Hybrid Search, and Reranking 🧠

Published: (March 1, 2026 at 03:00 AM EST)
8 min read
Source: Dev.to

Source: Dev.to

Overview

Basic chunking gets you a demo. Hybrid search, reranking with the Vertex AI Ranking API, metadata filtering, and tuned retrieval configs turn a RAG Engine corpus into a production system. All wired through Terraform and the Python SDK.

In RAG Post 1 we deployed a Vertex AI RAG Engine corpus with basic fixed‑size chunking. It works, but retrieval quality is mediocre. Your users ask nuanced questions and get incomplete or irrelevant answers back.

The fix isn’t a better generation model – it’s better retrieval. RAG Engine supports:

  • Chunking tuning
  • Hybrid search with configurable α weighting
  • Reranking via the Vertex AI Ranking API
  • Metadata filtering
  • Vector‑distance thresholds

The infrastructure layer (Terraform) and the operational layer (Python SDK) each handle different parts. This post covers the production patterns that make the difference. 🎯

Chunking

RAG Engine uses fixed‑size token chunking configured at file import time. Unlike AWS Bedrock (which offers semantic and hierarchical strategies as native options), GCP keeps chunking straightforward but gives you fine‑grained control over size and overlap.

Key insight: Chunking configuration is set per import operation, not per corpus. You can re‑import the same files with different chunking to test what works best.

from vertexai import rag

# Production chunking config
rag.import_files(
    corpus_name=corpus.name,
    paths=["gs://company-docs-prod/policies/"],
    transformation_config=rag.TransformationConfig(
        chunking_config=rag.ChunkingConfig(
            chunk_size=512,
            chunk_overlap=100,
        )
    ),
    max_embedding_requests_per_min=900,
)
Document TypeChunk SizeOverlapWhy
Short FAQs, Q&A pairs25630Small chunks = precise matching
General docs, guides512100Balanced precision and context
Long legal/technical docs1024200Preserves cross‑reference context
Pre‑processed content (already split)0 (use as‑is)0Already split at natural boundaries

Tuning approach

  1. Start with chunk_size=512 / overlap=100.
  2. If answers lack context → increase to 1024/200.
  3. If retrieval returns irrelevant chunks → decrease to 256/50.

Re‑import and compare – the corpus supports multiple imports with different configs against the same files.

Embedding Rate

The max_embedding_requests_per_min parameter is critical in production. Without it, large imports can exhaust your embedding‑model quota and fail partway through. Set it below your project’s QPM limit for the embedding model.

# Terraform outputs feed into SDK config
# environments/prod.tfvars sets the quota boundary
embedding_qpm_rate = 900  # Leave headroom below 1000 QPM limit

By default, RAG Engine uses pure vector (dense) search. Hybrid search combines vector similarity with keyword (sparse/token‑based) matching using Reciprocal Rank Fusion (RRF). The α parameter controls the balance.

α ValueBehavior
1.0Pure vector / semantic search
0.5Equal weight (default)
0.0Pure keyword search
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    filter=rag.Filter(
        vector_distance_threshold=0.3,
    ),
    hybrid_search=rag.HybridSearch(
        alpha=0.6  # Slightly favor semantic, but include keyword matching
    ),
)

# Retrieve‑only (no generation)
response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What is policy ABC-123 regarding overtime?",
    rag_retrieval_config=rag_retrieval_config,
)

When to Adjust α

  • Specific codes, IDs, or exact terminology (e.g., policy numbers, product SKUs, error codes) → lower α toward 0.3–0.4.
  • Natural‑language questions about concepts → keep α in the 0.6–0.8 range.

Reranking

Similarity isn’t the same as relevance. Reranking rescoring the top‑K chunks with a deeper query‑document understanding can dramatically improve results. RAG Engine integrates with two reranking approaches.

1️⃣ Google‑Hosted Ranking Service

Uses Google’s pre‑trained ranking models via the Discovery Engine API. Requires enabling the service:

# rag/apis.tf
resource "google_project_service" "discovery_engine" {
  project = var.project_id
  service = "discoveryengine.googleapis.com"

  disable_dependent_services = false
  disable_on_destroy         = false
}

Configure at retrieval time:

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=15,  # Retrieve wide
    ranking=rag.Ranking(
        rank_service=rag.RankService(
            model_name="semantic-ranker-default@latest"
        )
    ),
    hybrid_search=rag.HybridSearch(alpha=0.6),
)

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What are the penalties for late contract delivery?",
    rag_retrieval_config=rag_retrieval_config,
)

Pattern: Retrieve 15 chunks with hybrid search, let the rank service re‑score and return the most relevant. This “retrieve wide, rerank narrow” approach consistently outperforms retrieving only 5 chunks directly.

2️⃣ LLM‑Based Ranker

Uses an LLM to re‑rank results. Higher latency but can handle nuanced relevance judgments.

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    ranking=rag.Ranking(
        llm_ranker=rag.LlmRanker(
            model_name="gemini-2.0-flash"
        )
    ),
)

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What are the penalties for late contract delivery?",
    rag_retrieval_config=rag_retrieval_config,
)

Trade‑off

  • Rank Service – faster & cheaper.
  • LLM Ranker – better for complex, ambiguous queries.

Start with the Rank Service and switch to the LLM Ranker only for specific query patterns where relevance is poor.

Metadata Filtering

Scope retrieval to specific document categories using metadata filters. Metadata is applied at query time as a filter string.

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    filter=rag.Filter(
        vector_distance_threshold=0.3,
        metadata_filter="department = 'legal' AND year >= 2024"
    ),
    hybrid_search=rag.HybridSearch(alpha=0.6),
)

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What changed in the",
)

(The example query is truncated in the source material; replace with the full question as needed.)

Metadata & Import

Metadata is attached during file import. For GCS‑sourced files, metadata comes from the file’s properties or can be set programmatically during import operations.

Vector‑Distance Threshold

The vector_distance_threshold parameter filters out low‑relevance chunks before they reach the model. Only chunks with a vector distance below the threshold are returned.

# Strict filtering – only highly relevant chunks
filter = rag.Filter(vector_distance_threshold=0.3)

# Relaxed filtering – cast a wider net
filter = rag.Filter(vector_distance_threshold=0.7)

Tuning Guide

Starting pointWhen to tightenWhen to relax
0.5Irrelevant chunks appear → lower to 0.3Too few results → raise to 0.7

Tip: When using reranking, set a relaxed threshold (e.g., 0.7) to let more candidates through, then let the reranker sort by relevance.

Infrastructure Layer (Terraform)

The Terraform configuration provisions the required APIs, GCS bucket, IAM, and the Vertex AI RAG Engine.

# rag/main.tf
resource "google_project_service" "required_apis" {
  for_each = toset([
    "aiplatform.googleapis.com",
    "discoveryengine.googleapis.com",
    "storage.googleapis.com",
  ])

  project = var.project_id
  service = each.value

  disable_dependent_services = false
  disable_on_destroy         = false
}

resource "google_vertex_ai_rag_engine_config" "this" {
  region = var.region

  rag_managed_db {
    type = var.rag_db_tier
  }

  depends_on = [google_project_service.required_apis]
}

resource "google_storage_bucket" "rag_docs" {
  name     = "${var.project_id}-${var.environment}-rag-docs"
  location = var.region

  uniform_bucket_level_access = true

  lifecycle_rule {
    condition { age = var.doc_retention_days }
    action    { type = "Delete" }
  }
}

Environment‑Specific Variables

dev.tfvars

rag_db_tier         = "BASIC"
doc_retention_days  = 90
embedding_qpm_rate  = 500

# Retrieval config (passed to SDK)
chunk_size          = 300
chunk_overlap       = 50
retrieval_top_k     = 5
alpha               = 0.5
distance_threshold  = 0.5
reranker            = "none"

prod.tfvars

rag_db_tier         = "SCALED"
doc_retention_days  = 2555
embedding_qpm_rate  = 900

# Retrieval config (passed to SDK)
chunk_size          = 512
chunk_overlap       = 100
retrieval_top_k     = 15
alpha               = 0.6
distance_threshold  = 0.3
reranker            = "semantic-ranker-default@latest"

Feature Comparison

FeatureAzure AI SearchAWS Bedrock KBGCP RAG Engine
ChunkingFixed‑size + Document Layout skillFixed, hierarchical, semantic, LambdaFixed‑size only
Hybrid searchBM25 + vector via RRF (built‑in)Supported on OpenSearchAlpha‑weighted dense/sparse
Semantic rerankingBuilt‑in transformer ranker (L2)Cohere RerankRank Service + LLM Ranker
Query decompositionAgentic retrieval (native)Native API parameterNot built‑in
Metadata filteringFilterable index fields + ODataJSON metadata files in S3Filter string at query time
Strictness control1‑5 scale on data sourceNot built‑inVector distance threshold
Reranker score range0‑4 (calibrated, cross‑query consistent)Model‑dependentModel‑dependent

Takeaway: GCP’s advantage is operational simplicity – the managed vector DB and per‑import chunking make experimentation faster. AWS offers more built‑in chunking strategies and native query decomposition.

Your Situation Matrix

Use caseChunk size / overlapAlphaRerankerDistance threshold
Getting started, mixed docs512 / 1000.5None0.5
Users search by codes/IDs256 / 500.3Rank Service0.5
Long technical documents1024 / 2000.7Rank Service0.3
High‑precision production512 / 1000.6Rank Service0.3
Complex, ambiguous queries512 / 1000.6LLM Ranker0.5

Recommendation: Start with the “high‑precision production” row as your default configuration. Enable the Discovery Engine API, use Rank Service reranking, and fine‑tune from there.

Series Context

  • Post 1: Vertex AI RAG Engine – Basic Setup 🔍
  • Post 2 (you’re here): Advanced RAG – Chunking, Hybrid Search, Reranking 🧠

Your RAG pipeline just leveled up: hybrid search for precision, Rank Service reranking for relevance, metadata filtering for scope, and vector‑distance thresholds for noise control – all driven by Terraform variables per environment.

Found this helpful? Follow for the full RAG Pipeline with Terraform series! 💬

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...