Ditch the Bloat: Building a High-Performance Health Data Lake with DuckDB & Parquet

Published: (April 8, 2026 at 08:10 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Architecture: From XML Chaos to SQL Speed

The goal is to move from a slow, memory‑intensive XML structure to a columnar, compressed storage format optimized for analytical queries.

graph TD
    A[Apple Health Export XML] -->|Python Streaming Parser| B(Raw Data Extraction)
    B -->|Schema Mapping| C[Apache Parquet Files]
    C -->|Zero‑Copy Load| D{DuckDB Engine}
    D -->|Sub‑second SQL| E[Evidence.dev Dashboard]
    D -->|Advanced Analytics| F[Pandas/Polars]

    style D fill:#fff2cc,stroke:#d6b656,stroke-width:2px
    style C fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px

Prerequisites

  • Python 3.9+
  • DuckDB – “SQLite for Analytics”
  • Pandas / PyArrow – for Parquet handling
  • Your export.xml (or a sample file)

Step 1: Parsing the Beast (XML → Parquet)

The standard xml.etree.ElementTree will crash your RAM if the file is > 2 GB. Instead, we use a streaming parser and extract only HeartRate records.

import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def parse_health_data(xml_path: str, output_parquet: str):
    data = []
    # iterparse handles large files without loading everything into memory
    context = ET.iterparse(xml_path, events=('end',))

    for event, elem in context:
        if elem.tag == 'Record' and elem.get('type') == 'HKQuantityTypeIdentifierHeartRate':
            data.append({
                'timestamp': elem.get('startDate'),
                'value': float(elem.get('value')),
                'unit': elem.get('unit')
            })

        # Free memory
        elem.clear()

        # Write in batches to keep memory low
        if len(data) > 100_000:
            save_batch(data, output_parquet)
            data = []

    # Write any remaining rows
    if data:
        save_batch(data, output_parquet)

def save_batch(data: list[dict], path: str):
    df = pd.DataFrame(data)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, root_path=path, partition_cols=None)

# Example usage:
# parse_health_data('export.xml', 'heart_rate_lake')

Step 2: The Magic of DuckDB

Once the data lives in Parquet, it’s stored column‑wise. DuckDB can query it without loading the whole file into memory.

import duckdb

# Connect to a persistent DuckDB instance (or use ':memory:' for in‑memory)
con = duckdb.connect(database='health_analytics.db')

# Create a view directly on top of the Parquet files (zero‑copy!)
con.execute("""
    CREATE VIEW heart_rates AS
    SELECT *
    FROM read_parquet('heart_rate_lake/*.parquet')
""")

# Example analytical query: hourly resting heart‑rate trend
res = con.execute("""
    SELECT
        date_trunc('hour', timestamp) AS hour,
        avg(value) AS avg_bpm,
        max(value) AS max_bpm
    FROM heart_rates
    WHERE value 
""")

Conclusion

By moving away from “bloatware” formats and embracing a modern data stack (Python + Parquet + DuckDB), you transform health data from a static backup into a dynamic research tool. Trends in recovery, sleep quality, and cardiovascular health become discoverable in milliseconds.

What’s next?

  • Partition your Parquet files by year or month for even faster scans.
  • Integrate Evidence.dev for a beautiful, code‑based dashboard.

Are you analyzing your own health data? Let me know in the comments what insights you’ve uncovered!

0 views
Back to Blog

Related posts

Read more »