Quantified Self at Scale: Processing Millions of Wearable Metrics with ClickHouse š
Source: Dev.to
Introduction
Are you a data nerd who tracks every heartbeat, step, and sleep stage? If you own an Oura Ring and an Apple Watch, youāre sitting on a goldmine of highāfrequency biometric data. As your Quantified Self journey grows, a simple CSV or a standard relational database starts to crawl. When youāre dealing with millions of sensor readings, you need a speed demon.
In this guide we dive deep into data engineering for personal health, exploring why ClickHouse is the ultimate choice for timeāseries health data and how to build a pipeline that delivers subāsecond insights. Weāll cover highāperformance columnar storage, efficient schema design for timeāseries analytics, and realātime visualization.
Data Ingestion Pipeline
graph TD
A[Oura Ring API] -->|JSON| B(Python Ingestion Engine)
C[Apple Watch Export] -->|XML/CSV| B
B -->|Pandas/Batch| D{ClickHouse DB}
D -->|SQL Aggregation| E[Apache Superset]
D -->|Analysis| F[Jupyter Notebooks]
style D fill:#f96,stroke:#333,stroke-width:4px
Why ClickHouse for Wearable Metrics
When you have millions of rows of heartārate data (sampled every few minutes or even seconds), traditional databases like PostgreSQL struggle with analytical aggregations. ClickHouse shines because:
- Columnar Storage ā reads only the columns you need (e.g., just
heart_rate). - Data Compression ā 10āÆmillion rows can shrink to a fraction of their original size.
- Vectorized Execution ā processes data in chunks, making queries like āaverage heart rate per monthā nearly instantaneous.
Schema Design
In a highāperformance setup, schema design is everything. Weāll use the MergeTree engine, the workhorse of ClickHouse.
CREATE TABLE IF NOT EXISTS health_metrics (
event_time DateTime64(3, 'UTC'),
metric_name LowCardinality(String),
value Float64,
source_device LowCardinality(String),
user_id UInt32,
-- Tags for extra metadata (e.g., 'sleep', 'workout')
tags Array(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, metric_name, event_time)
SETTINGS index_granularity = 8192;
Why this works
LowCardinality(String): saves space for repetitive values such asmetric_name(āHeartRateā, āSPO2ā).PARTITION BY: speeds up deletions and organizes data by month.ORDER BY: determines physical sorting on disk, allowing ClickHouse to locate specific metrics for specific users lightningāfast.
Ingestion with Python
We use the clickhouse-connect library for a highāperformance interface. Instead of inserting row by row (a ClickHouse antiāpattern), we batch our data.
import clickhouse_connect
import pandas as pd
# Connect to our local ClickHouse instance via Docker
client = clickhouse_connect.get_client(host='localhost', port=8123, username='default')
def ingest_wearable_data(df: pd.DataFrame):
"""
Ingests a Pandas DataFrame into ClickHouse.
Expects columns: ['event_time', 'metric_name', 'value', 'source_device', 'user_id', 'tags']
"""
print(f"š Ingesting {len(df)} rows of data...")
# ClickHouse loves batches!
client.insert_df('health_metrics', df)
print("ā
Ingestion complete.")
# Example: Processing Oura Sleep Data
# df = pd.read_json('oura_sleep_data.json')
# ingest_wearable_data(df)
Sample Query: HRV During Sleep
SELECT
toStartOfDay(event_time) AS day,
avg(value) AS avg_hrv,
quantile(0.9)(value) AS p90_hrv
FROM health_metrics
WHERE metric_name = 'HRV'
AND tags HAS 'sleep'
GROUP BY day
ORDER BY day DESC
LIMIT 30;
Even with 50āÆmillion rows, this query typically returns in underāÆ50āÆms. šļø
Scaling Considerations
While the setup works great locally, productionāgrade health data platforms must handle:
- Schema evolution
- Multiātenancy
- Complex join patterns between sleep and activity metrics
For more productionāready examples and advanced dataāengineering patternsāsuch as integrating AIādriven health insights or building eventādriven architecturesācheck out the deepādive articles on the WellAlly Tech Blog. Itās a fantastic resource for engineers bridging the gap between āit works on my machineā and āit works for millions of users.ā
Visualization with Apache Superset
To make the data humanāreadable, connect Apache Superset to your ClickHouse instance:
- Add a new Database connection using the
clickhouse://driver. - Create a Timeāseries Chart.
- Use
event_timeas your temporal column.
Boom! You now have a professionalāgrade health dashboard that rivals the Oura and Apple Health apps themselves.
Conclusion
Quantifying yourself shouldnāt be limited by your tools. By moving from legacy databases to ClickHouse, you unlock the ability to query years of biometric data in the blink of an eye.
What are you tracking? Glucose levels, HRV, or perhaps your coding productivity metrics? Drop a comment below and letās talk data!
If you enjoyed this tutorial, donāt forget to ā¤ļø and š¦. For more advanced tutorials on data pipelines and system architecture, visit the WellAlly Blog.