Ditch the Bloat: Building a High-Performance Health Data Lake with DuckDB & Parquet

Published: 3 weeks ago (April 8, 2026 at 08:10 PM EDT)

3 min read

Source: Dev.to

Source: Dev.to

The Architecture: From XML Chaos to SQL Speed

The goal is to move from a slow, memory‑intensive XML structure to a columnar, compressed storage format optimized for analytical queries.

graph TD
    A[Apple Health Export XML] -->|Python Streaming Parser| B(Raw Data Extraction)
    B -->|Schema Mapping| C[Apache Parquet Files]
    C -->|Zero‑Copy Load| D{DuckDB Engine}
    D -->|Sub‑second SQL| E[Evidence.dev Dashboard]
    D -->|Advanced Analytics| F[Pandas/Polars]

    style D fill:#fff2cc,stroke:#d6b656,stroke-width:2px
    style C fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px

Prerequisites

Python 3.9+
DuckDB – “SQLite for Analytics”
Pandas / PyArrow – for Parquet handling
Your export.xml (or a sample file)

Step 1: Parsing the Beast (XML → Parquet)

The standard xml.etree.ElementTree will crash your RAM if the file is > 2 GB. Instead, we use a streaming parser and extract only HeartRate records.

import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def parse_health_data(xml_path: str, output_parquet: str):
    data = []
    # iterparse handles large files without loading everything into memory
    context = ET.iterparse(xml_path, events=('end',))

    for event, elem in context:
        if elem.tag == 'Record' and elem.get('type') == 'HKQuantityTypeIdentifierHeartRate':
            data.append({
                'timestamp': elem.get('startDate'),
                'value': float(elem.get('value')),
                'unit': elem.get('unit')
            })

        # Free memory
        elem.clear()

        # Write in batches to keep memory low
        if len(data) > 100_000:
            save_batch(data, output_parquet)
            data = []

    # Write any remaining rows
    if data:
        save_batch(data, output_parquet)

def save_batch(data: list[dict], path: str):
    df = pd.DataFrame(data)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, root_path=path, partition_cols=None)

# Example usage:
# parse_health_data('export.xml', 'heart_rate_lake')

Step 2: The Magic of DuckDB

Once the data lives in Parquet, it’s stored column‑wise. DuckDB can query it without loading the whole file into memory.

import duckdb

# Connect to a persistent DuckDB instance (or use ':memory:' for in‑memory)
con = duckdb.connect(database='health_analytics.db')

# Create a view directly on top of the Parquet files (zero‑copy!)
con.execute("""
    CREATE VIEW heart_rates AS
    SELECT *
    FROM read_parquet('heart_rate_lake/*.parquet')
""")

# Example analytical query: hourly resting heart‑rate trend
res = con.execute("""
    SELECT
        date_trunc('hour', timestamp) AS hour,
        avg(value) AS avg_bpm,
        max(value) AS max_bpm
    FROM heart_rates
    WHERE value 
""")

Conclusion

By moving away from “bloatware” formats and embracing a modern data stack (Python + Parquet + DuckDB), you transform health data from a static backup into a dynamic research tool. Trends in recovery, sleep quality, and cardiovascular health become discoverable in milliseconds.

What’s next?

Partition your Parquet files by year or month for even faster scans.
Integrate Evidence.dev for a beautiful, code‑based dashboard.

Are you analyzing your own health data? Let me know in the comments what insights you’ve uncovered!

Ditch the Bloat: Building a High-Performance Health Data Lake with DuckDB & Parquet

The Architecture: From XML Chaos to SQL Speed

Prerequisites

Step 1: Parsing the Beast (XML → Parquet)

Step 2: The Magic of DuckDB

Conclusion

What’s next?

Related posts

How to Automate Your Life with Python Scripts - Updated April 12, 2026

ETL vs ELT: Which One Should You Use and Why?

Understanding Python Selenium Architecture

SteeplyOpinionated: A PR Review Tool for Tea Recipes (Returns 418, Obviously)