Quantified Self: Building a Blazing Fast Health Dashboard with DuckDB and Streamlit
Source: Dev.to
Why This Stack?
If you’ve been following modern data stacks, you know that OLAP for wearables is becoming a hot topic. Traditional Python libraries like Pandas are great, but they often struggle with the memory overhead of large nested XML structures.
| Component | Why It Matters |
|---|---|
| DuckDB | “SQLite for analytics.” An in‑process columnar database that runs SQL at the speed of light. |
| PyArrow | Zero‑copy bridge for moving data between formats. |
| Streamlit | Fastest way to turn data scripts into shareable web apps. |
Architecture 🏗️
The biggest challenge with wearable data is the ETL (Extract‑Transform‑Load) process. We need to turn a hierarchical XML file into a flattened, queryable Parquet format that DuckDB can devour.
graph TD
A[Apple Health export.xml] --> B[Python XML Parser]
B --> C[PyArrow Table]
C --> D[Parquet Storage]
D --> E[DuckDB Engine]
E --> F[Streamlit Dashboard]
F --> G[Millisecond Insights 🚀]
Prerequisites
Before we start, make sure you have your export.xml ready and the required tools installed:
pip install duckdb streamlit pandas pyarrow
Step 1 – From XML Chaos to Parquet Order
Apple’s XML format is… “unique.” We use PyArrow to define a schema and convert those records into a compressed Parquet file. This reduces the file size by up to 90 % and optimizes it for columnar reads.
import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def parse_health_data(xml_path: str, parquet_path: str = "health_data.parquet"):
"""Parse Apple Health export XML and write a Parquet file."""
tree = ET.parse(xml_path)
root = tree.getroot()
# Keep only Record elements for this dashboard
records = [
{
"type": rec.get("type"),
"value": rec.get("value"),
"unit": rec.get("unit"),
"creationDate": rec.get("creationDate"),
"startDate": rec.get("startDate"),
}
for rec in root.findall("Record")
]
df = pd.DataFrame(records)
# Convert dates and numeric values
df["startDate"] = pd.to_datetime(df["startDate"])
df["value"] = pd.to_numeric(df["value"], errors="coerce")
# Arrow Table → Parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, parquet_path, compression="snappy")
print("✅ Transformation complete!")
# parse_health_data("export.xml")
Step 2 – The Secret Sauce: DuckDB Logic
Instead of loading the entire Parquet file into RAM with Pandas, we query it directly using DuckDB. This lets us perform complex aggregations (e.g., heart‑rate variability vs. sleep quality) in milliseconds.
import duckdb
import pandas as pd
def get_heart_rate_summary(parquet_path: str = "health_data.parquet") -> pd.DataFrame:
"""Return daily average & max heart‑rate from the Parquet file."""
con = duckdb.connect(database=":memory:")
query = f"""
SELECT
date_trunc('day', startDate) AS day,
AVG(value) AS avg_heart_rate,
MAX(value) AS max_heart_rate
FROM '{parquet_path}'
WHERE type = 'HKQuantityTypeIdentifierHeartRate'
GROUP BY 1
ORDER BY 1 DESC
"""
return con.execute(query).df()
Step 3 – Building the Streamlit UI 🎨
Streamlit makes it incredibly easy to visualize these SQL results. Add sliders, date pickers, and interactive charts with just a few lines of code.
import streamlit as st
import plotly.express as px
# ----------------------------------------------------------------------
# Page configuration
# ----------------------------------------------------------------------
st.set_page_config(page_title="Quantified Self Dashboard", layout="wide")
st.title("🏃♂️ My Quantified Self Dashboard")
st.markdown(
"Analyzing millions of health records with **DuckDB** speed."
)
# ----------------------------------------------------------------------
# Load data
# ----------------------------------------------------------------------
df_hr = get_heart_rate_summary()
# ----------------------------------------------------------------------
# Layout
# ----------------------------------------------------------------------
col1, col2 = st.columns(2)
with col1:
st.subheader("Heart Rate Trends")
fig = px.line(
df_hr,
x="day",
y="avg_heart_rate",
title="Average Daily Heart Rate",
markers=True,
)
st.plotly_chart(fig, use_container_width=True)
with col2:
st.subheader("Raw DuckDB Query Speed")
st.code(
"""SELECT avg(value) FROM 'health_data.parquet'
WHERE type = 'HKQuantityTypeIdentifierHeartRate'"""
)
st.success("Query executed in ~0.002 s")
Run the app with:
streamlit run your_script.py
The “Official” Way to Scale 🥑
While building locally is fun, production‑grade data engineering requires more robust patterns: multi‑user environments, automated ingestion, and advanced machine‑learning pipelines. For a deep dive into those topics, check out the WellAlly Blog. They provide excellent architectural patterns and production‑ready examples that go far beyond a local script.
Conclusion
By switching from raw XML parsing to a DuckDB + Parquet workflow, we’ve turned a sluggish data problem into a high‑performance analytical tool. You no longer need a massive cluster to analyze millions of health records—just a laptop, a few Python libraries, and a dash of curiosity. Happy quantifying!
What are you tracking?
Whether it’s steps, sleep, or coding hours, let me know in the comments how you’re visualizing your life! 👇