Building a dbt Incremental Model for Parsing and Chunking PDFs for Snowflake Cortex Search Service

Published: (February 10, 2026 at 06:11 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Building an Incremental dbt Model for Cortex Search Service (RAG)

*This is an English translation of the original article:
*

To build Retrieval‑Augmented Generation (RAG) using Cortex Search Service, you first need a table that contains documents (e.g., PDFs) parsed with functions such as AI_PARSE_DOCUMENT.
Below is a complete walkthrough of how to create that table using dbt.


The dbt Model Implementation

The model parses and chunks PDF files stored in an S3 bucket (Snowflake customer case‑study PDFs).
Key points:

  • Incremental modelunique_key='relative_path' and incremental_strategy='delete+insert'.
    Files with the same path are deleted and re‑inserted on each run.
  • Skips processing when a record with the same path and an equal or newer last_modified timestamp already exists, saving on AI_PARSE_DOCUMENT costs.
  • Chunking uses SNOWFLAKE.CORTEX.SPLIT_TEXT_MARKDOWN_HEADER.
    The approach follows a blog post by Takada‑san from Snowflake.
{{ 
    config(
        materialized='incremental',
        unique_key='relative_path',
        incremental_strategy='delete+insert'
    ) 
}}

with directory_files as (

    select
        relative_path,
        last_modified,
        TO_FILE('@RAW_DB.SAMPLE_SOURCE_SNOWPIPE.STAGE_CORPUS_SAMPLE', relative_path) as doc_file
    from
        DIRECTORY(@RAW_DB.SAMPLE_SOURCE_SNOWPIPE.STAGE_CORPUS_SAMPLE) as d
    where
        relative_path like 'snowflake-case-studies/%.pdf'
        {% if is_incremental() %}
        and not exists (
            select 1
            from {{ this }} as t
            where t.relative_path = d.relative_path
              and t.last_modified >= d.last_modified
        )
        {% endif %}

),

parsed as (

    select
        relative_path,
        last_modified,
        SNOWFLAKE.CORTEX.AI_PARSE_DOCUMENT(
            doc_file,
            {'mode': 'LAYOUT', 'page_split': TRUE}
        ) as parsed_result
    from
        directory_files

),

concatenated as (

    select
        p.relative_path,
        p.last_modified,
        LISTAGG(f.value:content::varchar, '\n\n')
            within group (order by f.index) as full_content
    from
        parsed as p,
        lateral flatten(input => p.parsed_result:pages) as f
    group by
        p.relative_path, p.last_modified

),

chunked as (

    select
        c.relative_path,
        c.last_modified,
        ch.value:headers:header_1::varchar as header_1,
        ch.value:headers:header_2::varchar as header_2,
        ch.value:headers:header_3::varchar as header_3,
        ch.value:chunk::varchar          as chunk,
        ch.index                         as chunk_index
    from
        concatenated as c,
        lateral flatten(
            input => SNOWFLAKE.CORTEX.SPLIT_TEXT_MARKDOWN_HEADER(
                c.full_content,
                OBJECT_CONSTRUCT('#', 'header_1', '##', 'header_2', '###', 'header_3'),
                10000
            )
        ) as ch

)

select
    SHA2(relative_path || ':' || chunk_index, 256) as doc_chunk_id,
    relative_path,
    SPLIT_PART(relative_path, '/', -1)                     as file_name,
    REPLACE(REPLACE(SPLIT_PART(relative_path, '/', -1),
                   'Case_Study_', ''), '.pdf', '')          as case_study_name,
    header_1,
    header_2,
    header_3,
    chunk_index,
    chunk,
    last_modified,
    CURRENT_TIMESTAMP() as _parsed_at
from
    chunked
;

dbt_project.yml – Hooks for Stage Refresh & Change Tracking

name: 'sample_dbt_project'
version: '1.9.0'
config-version: 2

profile: 'sample_project'

models:
  sample_dbt_project:
    corpus:
      sample:
        +schema: corpus_sample
        +pre-hook:
          - "ALTER STAGE RAW_DB.SAMPLE_SOURCE_SNOWPIPE.STAGE_CORPUS_SAMPLE REFRESH"
        +post-hook:
          - "ALTER TABLE {{ this }} SET CHANGE_TRACKING = TRUE"
  • Pre‑hook – Refreshes the external stage that points to the S3 bucket, ensuring the directory table reflects the latest files before the model runs.
  • Post‑hook – Enables CHANGE_TRACKING on the resulting table, a requirement for Cortex Search Service.

First dbt Model Run & Cortex Search Service Setup

I started with only two PDFs in S3 (Bourbon and Aisan Takaoka case studies).
(Stage‑creation statements are omitted for brevity.)

Directory view in S3

Directory listing in S3

Resulting table after dbt build

Running dbt build creates a Snowflake table containing the parsed and chunked PDFs:

Resulting table preview


Next Steps

  1. Create a Cortex Search Service index that points to the table generated above.
  2. Ingest the data (the doc_chunk_id column will be used as the document identifier).
  3. Query the index from your RAG application to retrieve relevant chunks based on user prompts.

With the incremental dbt model in place, any new or updated PDFs added to the S3 bucket will be automatically parsed, chunked, and made searchable with minimal manual effort.

Creating a Cortex Search Service and Agent for the dbt Model

Below are the SQL statements used to create the Cortex Search Service and Cortex Agent for the dbt model.

-- Create Cortex Search Service
CREATE OR REPLACE CORTEX SEARCH SERVICE PROD_DB.AIML_SAMPLE.CSS_SNOWFLAKE_CASE_STUDIES
    ON chunk
    ATTRIBUTES file_name,
               case_study_name,
               header_1,
               header_2,
               header_3,
               chunk_index,
               last_modified
    WAREHOUSE = CORTEX_SEARCH_WH
    TARGET_LAG = '365 day'
    EMBEDDING_MODEL = 'voyage-multilingual-2'
    INITIALIZE = ON_CREATE
    COMMENT = 'Cortex Search Service for chunked Snowflake customer case study PDFs'
AS
SELECT
    file_name,
    case_study_name,
    header_1,
    header_2,
    header_3,
    chunk_index,
    chunk,
    last_modified
FROM PROD_DB.CORPUS_SAMPLE.COR_SNOWFLAKE_CASE_STUDIES_PARSED;

-- Create Agent
CREATE OR REPLACE AGENT PROD_DB.AIML_SAMPLE.AGENT_SNOWFLAKE_CASE_STUDIES
    COMMENT = 'Cortex Agent for searching and answering questions about Snowflake customer case study PDFs'
    PROFILE = '{"display_name": "Snowflake Case Study Search", "avatar": "search", "color": "#29B5E8"}'
    FROM SPECIFICATION
    $$
    instructions:
      system: |
        You are an assistant that answers questions about the content of Snowflake customer case study PDFs.
        Use the Cortex Search Service to search for relevant case study information
        and provide answers based on accurate information.
        If the information is not found in the search results, do not speculate — inform the user accordingly.
      response: |
        Please respond concisely.
        Include the referenced case study name (case_study_name) in your answer.

    tools:
      - tool_spec:
          type: "cortex_search"
          name: "SearchCaseStudies"
          description: "Search Snowflake customer case study PDFs"

    tool_resources:
      SearchCaseStudies:
        name: "PROD_DB.AIML_SAMPLE.CSS_SNOWFLAKE_CASE_STUDIES"
        max_results: "5"
    $$;

-- Add to Snowflake Intelligence
ALTER SNOWFLAKE INTELLIGENCE SNOWFLAKE_INTELLIGENCE_OBJECT_DEFAULT
    ADD AGENT PROD_DB.AIML_SAMPLE.AGENT_SNOWFLAKE_CASE_STUDIES;

Results from Snowflake Intelligence

After running a query through Snowflake Intelligence, the response looked like this:

Snowflake Intelligence Result


Updating S3 Data and Running the dbt Model a Second Time

To verify that incremental updates work correctly, two changes were made to the S3 bucket:

  1. Deleted the 2nd page of the Aisan Takaoka case‑study PDF and re‑uploaded it with the same file name.
  2. Added a new Chiba Bank case‑study PDF to the bucket.

S3 Bucket Changes

dbt Build Output

Running dbt build again produced the following:

  • The chunk count for Aisan Takaoka decreased (page removed).
  • New records for Chiba Bank were added.

Chunk Count Changes

The _PARSED_AT column shows that only the records for the modified files (Aisan Takaoka and Chiba Bank) received updated timestamps.

Parsed At Timestamps

Refreshing the Cortex Search Service

With the table updated, the search service was refreshed:

ALTER CORTEX SEARCH SERVICE PROD_DB.AIML_SAMPLE.CSS_SNOWFLAKE_CASE_STUDIES REFRESH;

Querying through Snowflake Intelligence again confirmed that the updated data was being used for responses.

Updated Search Results


Conclusion

I explored building a dbt incremental model that parses PDFs, chunks them, and loads the chunks into a Cortex Search Service. By leveraging dbt’s incremental capabilities, only files that changed are re‑processed by the AI_PARSE_DOCUMENT function, which helps reduce compute costs.

I hope you find this useful!

0 views
Back to Blog

Related posts

Read more »

New article

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as we...

Build a Serverless RAG Engine for $0

Introduction: The Problem with “Toy” RAG Apps Most RAG tutorials skip the hard parts that actually matter in production: - No security model: Users can access...

Set up Ollama, NGROK, and LangChain

markdown !Breno A. V.https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fu...