← all posts

ClickHouse vs DuckDB on Real FHIR Data: What the Benchmark Actually Shows

Most ClickHouse vs DuckDB comparisons run TPC-H at some synthetic scale factor and call it done. This one is different. I ran four queries against real synthetic FHIR data: 28,709 patients, 14.8 million observations, generated by Synthea. No artificial amplification. This was part of a pitch deck prepared for Altinity, the managed ClickHouse company.

Before the numbers, the internals. The results do not make sense without understanding what each engine is actually doing.

How DuckDB Executes Queries

DuckDB is an in-process analytical engine. No server, no network layer, no connection pool. It runs inside your Python process as a library.

The execution model is vectorized query processing. Instead of processing one row at a time, DuckDB processes batches of values from a single column at a time. A typical batch is 1,024 to 2,048 values. These are called vectors.

Why this matters: modern CPUs have SIMD instructions (Single Instruction Multiple Data) that can operate on 128, 256, or 512 bits simultaneously. If you feed the CPU a vector of 16 integers and apply the same comparison to all of them at once, you get 16x throughput compared to a loop. DuckDB’s operators are designed to exploit this.

The filter operator gets a vector of values, produces a selection bitmask of which positions pass, and passes that downstream. No row-by-row overhead. Every operator in the pipeline works this way.

DuckDB also does aggressive late materialization. It defers expanding row indices into full column data until the last possible point in the plan. Most of the query runs on minimal data, and the full result gets assembled only at the output.

For data that fits in memory or in a well-partitioned Parquet file on local disk, DuckDB is very fast.

How ClickHouse Executes Queries

ClickHouse is a distributed columnar database. It runs as a server, typically on multiple nodes. It is also vectorized, but it goes further with per-type specialization.

For GROUP BY, ClickHouse has separate hand-tuned implementations for UInt64 keys, String keys, Nullable(Int32) keys, and many other combinations. It does not use a single general-purpose hash table. It picks the fastest implementation for the specific types in your query. This is the reason ClickHouse benchmark numbers on narrow aggregation queries look implausible.

The storage format is MergeTree. Data is sorted by a primary key and stored in granules of 8,192 rows by default. ClickHouse keeps a sparse index: one entry per granule rather than one entry per row. A query with WHERE date >= '2024-01-01' skips directly to the relevant granule range without scanning the file. For high-cardinality event data sorted by time, this is extremely effective.

ClickHouse also has native array functions: arrayJoin, splitByRegexp, replaceRegexpAll. These operate directly on nested array columns without needing UDFs. For FHIR data where observations have polymorphic value types and nested coding arrays, this is a real advantage.

For approximate aggregations ClickHouse has quantileTDigest and uniqHLL12, which give sub-percent-error answers over billions of rows in milliseconds by trading exactness for speed.

The Four Benchmark Queries

The queries model real clinical analytics workloads at realistic production scale. 28,709 patients. 14.8 million observations. Containerised and reproducible.

Q1: Cardiometabolic Risk Scoring

Scores each patient using weighted nonlinear aggregation over blood pressure, lipid panel, and glucose observations. Real use case: stratify high-risk patients for early intervention programs. This is a multi-column aggregation with non-trivial arithmetic per row.

Q2: Comorbidity Pair Mining

Self-joins the full conditions table to find the most frequent diagnosis co-occurrences across the patient population. Real use case: population health cohort identification and drug safety studies. This is a self-join on a table with many-to-many patient-condition relationships.

Q3: Polypharmacy Temporal Linkage

Joins medications to diagnoses per patient and computes lag quantiles between diagnosis date and prescription date. Real use case: measuring treatment adherence and delayed prescribing patterns. This involves window functions and quantile computation over joined data.

Q4: Population-Scale Domain Analytics

Single-pass scan of all 14.8 million observations. Classifies each observation into clinical domains (cardiovascular, metabolic, anthropometric, behavioral) via vectorized regex and computes distinct patient coverage, volume, and value distributions per domain. Real use case: population health dashboards and CMS quality reporting.

Results

QueryClickHouseDuckDBWinner
Q1 Cardiometabolic risk scoringfasterbaselineClickHouse 1.21x
Q2 Comorbidity pair miningbaselinefasterDuckDB 1.29x
Q3 Polypharmacy temporal linkagefasterbaselineClickHouse 1.42x
Q4 Population domain analyticsfasterbaselineClickHouse 1.20x

ClickHouse wins 3 of 4 queries. DuckDB wins Q2.

Why DuckDB Wins Q2

Q2 is a self-join: join the conditions table to itself on patient_id, then count co-occurring diagnosis pairs. This is exactly the pattern where DuckDB’s hash join optimizer has an edge.

DuckDB builds a hash table from the smaller side of the join, probes it with the larger side, and exploits its memory-resident processing model to avoid the I/O overhead that comes with larger-than-memory joins. For a self-join where both sides are the same table, DuckDB can keep the hash table in CPU cache for the probe phase.

ClickHouse’s hash join is fast but its MergeTree format is optimized for range scans on sorted data. A self-join does not benefit from the sparse index because you are joining on patient_id, not scanning by primary key order.

Why ClickHouse Wins the Other Three

Q1 (risk scoring) is a column aggregation with arithmetic. ClickHouse’s SIMD-optimized aggregation operators over MergeTree columnar storage handle this efficiently.

Q3 (polypharmacy temporal linkage) involves quantile computation. ClickHouse’s quantileTDigest is purpose-built for this. DuckDB’s approx_quantile is also fast but ClickHouse’s sketch library is more mature.

Q4 (domain analytics) is a full scan with regex classification. ClickHouse’s replaceRegexpAll and splitByRegexp are vectorized at the engine level. On 14.8 million rows of clinical text, this compounds into a meaningful advantage.

What This Tells You About When to Use Which

DuckDB’s advantage is the absence of infrastructure. pip install duckdb. It runs in your notebook, your Lambda function, your GitHub Action. Zero operational overhead.

ClickHouse’s advantage is sustained performance under production load, especially on wide aggregation queries against high-cardinality time-series data. The MergeTree format with a well-chosen sort key is genuinely fast in a way that is hard to match.

For the clinical analytics workloads in this benchmark, ClickHouse is faster by 1.2x to 1.4x on three of four queries. That margin is real but not enormous. Whether it justifies the operational overhead of running ClickHouse depends on your query frequency and scale.

At 14.8 million rows, both systems complete every query in seconds. The operational breakeven point where ClickHouse’s performance advantage justifies its operational cost starts somewhere around the billion-row scale, when DuckDB’s single-node model starts to strain.

Below that, DuckDB is very hard to justify replacing.