From Parquet to Snowflake: Query Smart, Load Fast

Published: 1 week ago (January 6, 2026 at 01:05 PM EST)

2 min read

Source: Dev.to

Cover image for From Parquet to Snowflake: Query Smart, Load Fast

The Problem

When working with large volumes of financial data, querying efficiently and loading the results into a data warehouse like Snowflake is crucial. The task is to generate daily metrics (total transaction volume, active customers, average balances) from 3 TB of Parquet data stored in AWS S3. The data is partitioned by transaction_date, but older partitions have inconsistent column names. The results must be loaded into Snowflake for further analysis.

The Approach

Efficiently Query the Data

Read only the last 30 days of data by using partition pruning, which saves time and cost.

Handle Schema Evolution

Use SQL functions like COALESCE to handle missing or differently named columns, ensuring a consistent schema across partitions.

Aggregate Metrics

Aggregate data by region to calculate total transaction volume, count active customers, and compute the average account balance.

Load Data into Snowflake

After processing, use Snowflake’s COPY INTO method for efficient, large‑scale ingestion, moving the results from a CSV file into the warehouse.

Why This Works

Partition pruning limits the query to relevant data, making it fast and cost‑efficient.
Schema handling with COALESCE provides seamless integration across evolving data partitions.
Snowflake’s optimized loading mechanisms enable fast and reliable data transfer.

This approach makes working with large, partitioned datasets in cloud storage manageable while ensuring efficient processing and loading into Snowflake.

The Solution in PySQL

Read the last 30 days of Parquet using partition pruning

import duckdb
import datetime

end = datetime.date.today()
start = end - datetime.timedelta(days=30)

con = duckdb.connect()
df = con.execute(f"""
    SELECT *
    FROM read_parquet('s3://bank-lake/transactions/transaction_date>= {start} AND transaction_date <= {end}/*.parquet')
""").fetchdf()

Aggregate metrics, handling schema differences

SELECT
    region,
    SUM(transaction_amount) AS total_tx,
    COUNT(DISTINCT customer_id) AS active_customers,
    AVG(COALESCE(account_balance, acct_balance)) AS avg_balance
FROM df
GROUP BY region

Load the results into Snowflake

import snowflake.connector

# Export the aggregated DataFrame to CSV
result.to_csv("daily.csv", index=False)

# Connect to Snowflake
conn = snowflake.connector.connect(
    user='YOUR_USER',
    password='YOUR_PASSWORD',
    account='YOUR_ACCOUNT'
)

# Upload and copy the CSV into Snowflake
conn.cursor().execute("""
    PUT file://daily.csv @%DAILY_REGION_METRICS;
    COPY INTO DAILY_REGION_METRICS
    FROM @%DAILY_REGION_METRICS
    FILE_FORMAT=(TYPE=CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');
""")