From Parquet to Snowflake: Query Smart, Load Fast
Source: Dev.to

The Problem
When working with large volumes of financial data, querying efficiently and loading the results into a data warehouse like Snowflake is crucial. The task is to generate daily metrics (total transaction volume, active customers, average balances) from 3 TB of Parquet data stored in AWS S3. The data is partitioned by transaction_date, but older partitions have inconsistent column names. The results must be loaded into Snowflake for further analysis.
The Approach
Efficiently Query the Data
Read only the last 30 days of data by using partition pruning, which saves time and cost.
Handle Schema Evolution
Use SQL functions like COALESCE to handle missing or differently named columns, ensuring a consistent schema across partitions.
Aggregate Metrics
Aggregate data by region to calculate total transaction volume, count active customers, and compute the average account balance.
Load Data into Snowflake
After processing, use Snowflake’s COPY INTO method for efficient, large‑scale ingestion, moving the results from a CSV file into the warehouse.
Why This Works
- Partition pruning limits the query to relevant data, making it fast and cost‑efficient.
- Schema handling with
COALESCEprovides seamless integration across evolving data partitions. - Snowflake’s optimized loading mechanisms enable fast and reliable data transfer.
This approach makes working with large, partitioned datasets in cloud storage manageable while ensuring efficient processing and loading into Snowflake.
The Solution in PySQL
Read the last 30 days of Parquet using partition pruning
import duckdb
import datetime
end = datetime.date.today()
start = end - datetime.timedelta(days=30)
con = duckdb.connect()
df = con.execute(f"""
SELECT *
FROM read_parquet('s3://bank-lake/transactions/transaction_date>= {start} AND transaction_date <= {end}/*.parquet')
""").fetchdf()
Aggregate metrics, handling schema differences
SELECT
region,
SUM(transaction_amount) AS total_tx,
COUNT(DISTINCT customer_id) AS active_customers,
AVG(COALESCE(account_balance, acct_balance)) AS avg_balance
FROM df
GROUP BY region
Load the results into Snowflake
import snowflake.connector
# Export the aggregated DataFrame to CSV
result.to_csv("daily.csv", index=False)
# Connect to Snowflake
conn = snowflake.connector.connect(
user='YOUR_USER',
password='YOUR_PASSWORD',
account='YOUR_ACCOUNT'
)
# Upload and copy the CSV into Snowflake
conn.cursor().execute("""
PUT file://daily.csv @%DAILY_REGION_METRICS;
COPY INTO DAILY_REGION_METRICS
FROM @%DAILY_REGION_METRICS
FILE_FORMAT=(TYPE=CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');
""")