JSON is Not Enough: The Engineering Headache of Flattening FHIR for Analytics

Published: 1 hour ago (January 2, 2026 at 11:00 AM EST)

4 min read

Source: Dev.to

The Problem: It’s a Graph, Not a Table

The issue isn’t that FHIR is JSON. The issue is that FHIR resources are deeply nested, recursive, and polymorphic.

Take a simple Patient resource. In a SQL table, you’d expect a column for Name. In FHIR? Name is an array of objects.

You have a Legal name.
You have a Maiden name.
You have a Nickname.

Each of those objects contains a given array (first name, middle name) and a family string.

{
  "resourceType": "Patient",
  "id": "example",
  "name": [
    {
      "use": "official",
      "family": "Chalmers",
      "given": ["Peter", "James"]
    },
    {
      "use": "nickname",
      "given": ["Jim"]
    }
  ],
  "identifier": [
    {
      "system": "http://hospital.org/mrns",
      "value": "12345"
    }
  ]
}

If you just toss this into BigQuery and try to query name.family, you get null or an error because it’s inside an array.

The “Explosion” Problem

To analyze this in SQL, you usually have to UNNEST or LATERAL VIEW EXPLODE these arrays.

If a Patient has 2 names, 3 identifiers, and 5 extensions, and you try to flatten this into a single wide row without being careful, you trigger a Cartesian Product.

2 * 3 * 5 = 30 rows for a single patient

Now imagine doing this for an ExplanationOfBenefit resource with hundreds of line items and diagnosis codes. Your 1 million‑record dataset suddenly becomes 100 million rows of duplicate garbage.

Strategy 1: The “Good Enough” Extraction

If your analysts only care about the Official name and the MRN, don’t try to genericize the flattening. Hard‑code the path.

In Python/Pandas, this often looks like writing specific lambdas. It’s brittle, but it works for MVPs.

import pandas as pd

# Assume 'data' is your list of dicts
def get_official_family_name(patient_row):
    names = patient_row.get('name', [])
    for n in names:
        if n.get('use') == 'official':
            return n.get('family')
    return None

# Apply it
df['official_last_name'] = df.apply(get_official_family_name, axis=1)

This is okay for small scripts, but it doesn’t scale well in an ETL pipeline processing terabytes of data.

Strategy 2: Array‑aware Columnar Formats (The Real Fix)

The modern approach—and what is usually recommended—is to stop trying to force FHIR into a CSV‑style flat table immediately. Instead, move it to a columnar format that supports nested structures, like Parquet.

Tools like Apache Spark or Databricks preserve the schema. Eventually, you do need to flatten it for the Tableau/PowerBI crowd. The trick is to create Satellite Tables.

Instead of one giant Patient table, create:

Patient_Core (id, birthdate, gender)
Patient_Names (id, use, family, given)
Patient_Identifiers (id, system, value)

This normalizes the data, essentially reverse‑engineering the JSON back into a relational 3rd Normal Form.

# Pseudo-code for a Spark transformation
from pyspark.sql.functions import col, explode

# Read raw FHIR JSON
df = spark.read.json("s3://bucket/fhir/Patient/")

# Create the Names table
df_names = df.select(
    col("id").alias("patient_id"),
    explode(col("name")).alias("name_struct")
).select(
    "patient_id",
    "name_struct.use",
    "name_struct.family",
    "name_struct.given"  # Note: given is still an array here!
)

# Write out as Parquet
df_names.write.parquet("s3://bucket/analytics/patient_names/")

Context is King

The biggest headache isn’t just the structure; it’s the loss of context.

When you flatten Observation.component, you might get values 80 and 120. Which is Systolic and which is Diastolic? In the JSON, they are sibling objects. If you flatten them blindly into two rows, you must ensure you carry the code field with the corresponding value field.

Always flatten the object containing the value and its label together. Never flatten them independently, or you’ll lose the association between numbers and their meanings.

Conclusion

FHIR is here to stay, and the schema is brilliant for clinical accuracy. It just requires a shift in mindset for Data Engineering. Move away from a “One Big Table” mentality and treat arrays and structs as first‑class citizens in your pipelines.

If you are struggling with complex data architectures or want to see how others tackle these ETL challenges, check out more tech guides on my other articles.

Happy coding, and may your JSON always be valid!

JSON is Not Enough: The Engineering Headache of Flattening FHIR for Analytics

The Problem: It’s a Graph, Not a Table

The “Explosion” Problem

Strategy 1: The “Good Enough” Extraction

Strategy 2: Array‑aware Columnar Formats (The Real Fix)

Context is King

Conclusion

Related posts

AWS Community Day Ecuador 2025

Goodbye Heroku: How I Built My Own PaaS on Linode for $5

Introduction :)

hi , guys i want to make a bootcamp chat in full-stack next.js , postgresql, redis and back-end with django ,fastAPI is anyone interested and serious so we can motivate us and learn faster, better than solo learning i already study this stack for 2 years