Ditch the Bloat: Building a High-Performance Health Data Lake with DuckDB & Parquet
Source: Dev.to
The Architecture: From XML Chaos to SQL Speed
The goal is to move from a slow, memory‑intensive XML structure to a columnar, compressed storage format optimized for analytical queries.
graph TD
A[Apple Health Export XML] -->|Python Streaming Parser| B(Raw Data Extraction)
B -->|Schema Mapping| C[Apache Parquet Files]
C -->|Zero‑Copy Load| D{DuckDB Engine}
D -->|Sub‑second SQL| E[Evidence.dev Dashboard]
D -->|Advanced Analytics| F[Pandas/Polars]
style D fill:#fff2cc,stroke:#d6b656,stroke-width:2px
style C fill:#dae8fc,stroke:#6c8ebf,stroke-width:2pxPrerequisites
- Python 3.9+
- DuckDB – “SQLite for Analytics”
- Pandas / PyArrow – for Parquet handling
- Your
export.xml(or a sample file)
Step 1: Parsing the Beast (XML → Parquet)
The standard xml.etree.ElementTree will crash your RAM if the file is > 2 GB. Instead, we use a streaming parser and extract only HeartRate records.
import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def parse_health_data(xml_path: str, output_parquet: str):
data = []
# iterparse handles large files without loading everything into memory
context = ET.iterparse(xml_path, events=('end',))
for event, elem in context:
if elem.tag == 'Record' and elem.get('type') == 'HKQuantityTypeIdentifierHeartRate':
data.append({
'timestamp': elem.get('startDate'),
'value': float(elem.get('value')),
'unit': elem.get('unit')
})
# Free memory
elem.clear()
# Write in batches to keep memory low
if len(data) > 100_000:
save_batch(data, output_parquet)
data = []
# Write any remaining rows
if data:
save_batch(data, output_parquet)
def save_batch(data: list[dict], path: str):
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path=path, partition_cols=None)
# Example usage:
# parse_health_data('export.xml', 'heart_rate_lake')Step 2: The Magic of DuckDB
Once the data lives in Parquet, it’s stored column‑wise. DuckDB can query it without loading the whole file into memory.
import duckdb
# Connect to a persistent DuckDB instance (or use ':memory:' for in‑memory)
con = duckdb.connect(database='health_analytics.db')
# Create a view directly on top of the Parquet files (zero‑copy!)
con.execute("""
CREATE VIEW heart_rates AS
SELECT *
FROM read_parquet('heart_rate_lake/*.parquet')
""")
# Example analytical query: hourly resting heart‑rate trend
res = con.execute("""
SELECT
date_trunc('hour', timestamp) AS hour,
avg(value) AS avg_bpm,
max(value) AS max_bpm
FROM heart_rates
WHERE value
""")Conclusion
By moving away from “bloatware” formats and embracing a modern data stack (Python + Parquet + DuckDB), you transform health data from a static backup into a dynamic research tool. Trends in recovery, sleep quality, and cardiovascular health become discoverable in milliseconds.
What’s next?
- Partition your Parquet files by
yearormonthfor even faster scans. - Integrate Evidence.dev for a beautiful, code‑based dashboard.
Are you analyzing your own health data? Let me know in the comments what insights you’ve uncovered!