JSON is Not Enough: The Engineering Headache of Flattening FHIR for Analytics
Source: Dev.to
The Problem: It’s a Graph, Not a Table
The issue isn’t that FHIR is JSON. The issue is that FHIR resources are deeply nested, recursive, and polymorphic.
Take a simple Patient resource. In a SQL table, you’d expect a column for Name. In FHIR? Name is an array of objects.
- You have a Legal name.
- You have a Maiden name.
- You have a Nickname.
Each of those objects contains a given array (first name, middle name) and a family string.
{
"resourceType": "Patient",
"id": "example",
"name": [
{
"use": "official",
"family": "Chalmers",
"given": ["Peter", "James"]
},
{
"use": "nickname",
"given": ["Jim"]
}
],
"identifier": [
{
"system": "http://hospital.org/mrns",
"value": "12345"
}
]
}
If you just toss this into BigQuery and try to query name.family, you get null or an error because it’s inside an array.
The “Explosion” Problem
To analyze this in SQL, you usually have to UNNEST or LATERAL VIEW EXPLODE these arrays.
If a Patient has 2 names, 3 identifiers, and 5 extensions, and you try to flatten this into a single wide row without being careful, you trigger a Cartesian Product.
2 * 3 * 5 = 30 rows for a single patient
Now imagine doing this for an ExplanationOfBenefit resource with hundreds of line items and diagnosis codes. Your 1 million‑record dataset suddenly becomes 100 million rows of duplicate garbage.
Strategy 1: The “Good Enough” Extraction
If your analysts only care about the Official name and the MRN, don’t try to genericize the flattening. Hard‑code the path.
In Python/Pandas, this often looks like writing specific lambdas. It’s brittle, but it works for MVPs.
import pandas as pd
# Assume 'data' is your list of dicts
def get_official_family_name(patient_row):
names = patient_row.get('name', [])
for n in names:
if n.get('use') == 'official':
return n.get('family')
return None
# Apply it
df['official_last_name'] = df.apply(get_official_family_name, axis=1)
This is okay for small scripts, but it doesn’t scale well in an ETL pipeline processing terabytes of data.
Strategy 2: Array‑aware Columnar Formats (The Real Fix)
The modern approach—and what is usually recommended—is to stop trying to force FHIR into a CSV‑style flat table immediately. Instead, move it to a columnar format that supports nested structures, like Parquet.
Tools like Apache Spark or Databricks preserve the schema. Eventually, you do need to flatten it for the Tableau/PowerBI crowd. The trick is to create Satellite Tables.
Instead of one giant Patient table, create:
Patient_Core(id, birthdate, gender)Patient_Names(id, use, family, given)Patient_Identifiers(id, system, value)
This normalizes the data, essentially reverse‑engineering the JSON back into a relational 3rd Normal Form.
# Pseudo-code for a Spark transformation
from pyspark.sql.functions import col, explode
# Read raw FHIR JSON
df = spark.read.json("s3://bucket/fhir/Patient/")
# Create the Names table
df_names = df.select(
col("id").alias("patient_id"),
explode(col("name")).alias("name_struct")
).select(
"patient_id",
"name_struct.use",
"name_struct.family",
"name_struct.given" # Note: given is still an array here!
)
# Write out as Parquet
df_names.write.parquet("s3://bucket/analytics/patient_names/")
Context is King
The biggest headache isn’t just the structure; it’s the loss of context.
When you flatten Observation.component, you might get values 80 and 120. Which is Systolic and which is Diastolic? In the JSON, they are sibling objects. If you flatten them blindly into two rows, you must ensure you carry the code field with the corresponding value field.
Always flatten the object containing the value and its label together. Never flatten them independently, or you’ll lose the association between numbers and their meanings.
Conclusion
FHIR is here to stay, and the schema is brilliant for clinical accuracy. It just requires a shift in mindset for Data Engineering. Move away from a “One Big Table” mentality and treat arrays and structs as first‑class citizens in your pipelines.
If you are struggling with complex data architectures or want to see how others tackle these ETL challenges, check out more tech guides on my other articles.
Happy coding, and may your JSON always be valid!