FHIR to Parquet: 39x Faster Queries on Healthcare Data
Healthcare data is stored in a format that makes analytics deliberately painful.
FHIR (Fast Healthcare Interoperability Resources) is the standard for how hospitals exchange patient records. It is deeply nested JSON designed for transmission between systems, not for running queries. A single patient observation has the value buried four levels deep: code.coding[0].display. Running a cohort query against raw FHIR means scanning the entire JSON blob for every record.
I gave a version of this talk at a Seattle data meetup. The audience included staff engineers from Databricks and Uber, and a few startup founders. The ten-minute talk format forced me to get very precise about what actually matters here.
The Dataset
I used Synthea, a synthetic patient data generator from MITRE. It produces realistic FHIR bundles for fictional patients, including all the resource types you would see in a real EHR: patients, observations, encounters, conditions, medications, procedures.
The raw output for a 10,000 patient simulation:
| Resource type | Raw JSON size |
|---|---|
| observations | 18.7 GB |
| encounters | 9.1 GB |
| conditions | 6.8 GB |
| medications | 7.4 GB |
| procedures | 4.8 GB |
| patients | 4.2 GB |
| total | 51 GB |
The observations table is the largest because a patient can have dozens of lab results per encounter.
Why Parquet
Parquet is a columnar format. When you run SELECT loinc_code, AVG(value) FROM observations WHERE recorded_at > '2024-01-01', a columnar store reads only the three columns you need. A row-based format (or a JSON blob) reads every column in every row to find the ones you want.
For FHIR specifically, this is a dramatic difference. A typical cohort analysis touches maybe five columns out of twenty. You are reading 75% less data before you even run the query.
The second benefit is compression. Parquet with ZSTD compression gets FHIR observations down to roughly 30% of the original JSON size. The cardinality of LOINC codes and patient IDs means columnar compression is very effective.
The Transformation
DuckDB is the right tool for this. Its JSON reading is fast, the auto-schema inference handles FHIR’s nested structure well, and you can write directly to Parquet in a single SQL statement.
CREATE TABLE observations AS
SELECT
id,
subject.reference AS patient_id,
code.coding[1].code AS loinc_code,
code.coding[1].display AS observation_name,
valueQuantity.value::DOUBLE AS value,
valueQuantity.unit AS unit,
effectiveDateTime::TIMESTAMP AS recorded_at,
encounter.reference AS encounter_id
FROM read_json_auto(
'observations/*.json',
format = 'array',
records = 'true'
);
COPY observations
TO 'observations.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
The code.coding[1] pattern repeats across almost every FHIR resource type. Observations, conditions, procedures, medications all have a code field with a coding array. Once you write the extraction for observations you can adapt it to any other resource in a few minutes.
The one thing DuckDB’s auto-inference struggles with is deeply nested arrays where element types vary. In practice this shows up in component observations (blood pressure, for example, which has a systolic and a diastolic component as separate entries in the same array). You need a small preprocessing step to flatten those before loading.
The Query Test
The benchmark query: find all patients with at least one HbA1c observation above 7.0 in 2024 and count how many also have a diabetes condition code.
-- HbA1c LOINC code: 4548-4
WITH high_a1c AS (
SELECT DISTINCT patient_id
FROM observations
WHERE loinc_code = '4548-4'
AND value > 7.0
AND recorded_at >= '2024-01-01'
AND recorded_at < '2025-01-01'
)
SELECT COUNT(DISTINCT c.patient_id) AS diabetic_high_a1c
FROM conditions c
JOIN high_a1c h ON c.patient_id = h.patient_id
WHERE c.code = 'E11.9'; -- Type 2 diabetes ICD-10
Results:
| Format | Query time |
|---|---|
Raw FHIR JSON (read_json_auto) | 4 min 12 s |
| Parquet (ZSTD) | 6.4 s |
That is the 39x number.
The JSON path is slow for two reasons: it has to parse the entire blob for every record, and it cannot use predicate pushdown (it cannot skip records based on column values without reading them first). Parquet pushes the loinc_code = '4548-4' filter down to the storage layer and reads a fraction of the file.
What the Meetup Audience Cared About
The question I got most was about partitioning strategy. If you partition the Parquet files by patient_id, single-patient lookups are very fast but cohort queries (which touch many patients) still scan everything. If you partition by date, date-range queries are fast but single-patient lookups scan multiple partitions.
For analytics workloads the right partition key is usually time. For patient-facing workloads (looking up a specific patient’s history) you partition by patient_id. For mixed workloads, two copies with different partition schemes is not unreasonable given how cheap S3 storage is.
The other question was about schema evolution. FHIR versions change, LOINC codes get updated, new resource types are added. Parquet handles schema evolution reasonably well if you use a schema registry or at least version your output files. Delta Lake or Apache Iceberg on top of Parquet adds proper schema versioning and time travel, which for healthcare data where you may need to reconstruct a patient’s history as it was on a specific date is genuinely useful.
The full transformation code is on GitHub.