ClickHouse Benchmark: ClickBench, Performance Characteristics, and What the Numbers Mean

When someone says "ClickHouse is fast," they usually mean it in the context of ClickBench - the open benchmark maintained by ClickHouse, Inc. that has become the de facto leaderboard for analytical databases. Understanding what ClickBench actually measures, why ClickHouse performs the way it does on those queries, and how to apply any of this to your own workload requires going past the headline numbers.

What ClickBench Is and What It Measures

ClickBench uses a single flat table containing 99,997,497 rows of real web analytics data - click events, session attributes, URL strings, referrer data, and a mix of integer and high-cardinality string columns. The data is derived from actual production traffic from a major web analytics platform, anonymized but with realistic distributions preserved. It is available as CSV, Parquet, and JSONlines, which makes loading it into systems with different ingestion capabilities straightforward.

The 43 queries range from trivially simple counts to multi-filter GROUP BY aggregations over high-cardinality string columns. Some scan the entire dataset with no filter. Others apply tight WHERE conditions over indexed columns. Query Q1 just counts rows; Q33 runs a multi-column GROUP BY with several WHERE clauses targeting URL substrings. The spread matters because it exposes whether a system is fast across a range of access patterns or just good at one specific operation type.

The standard execution environment is an AWS c6a.4xlarge instance (16 vCPUs, 32 GB RAM) with a 500 GB gp2 volume. Each query runs three times. The first run is the cold run — at minimum, the OS page cache is cleared before execution. A stricter "true cold run" also restarts the database server and flushes internal caches; many submissions (especially managed services) only clear the OS page cache, which ClickBench calls a "lukewarm cold run." The second and third runs measure warm cache behavior. Results are reported as the cold time and the minimum of runs two and three. That cold/warm distinction is operationally relevant: a system that's blazing fast with warm caches but slow on initial page loads may look better on a hot-cache leaderboard while hiding a real production bottleneck.

Why ClickHouse Is Fast on This Workload

ClickHouse's performance on analytical scans comes from a stack of decisions that compound across the query path. Columnar storage is the foundation - rather than reading entire rows from disk, ClickHouse reads only the columns a query references. On a 100-column table where a GROUP BY touches three columns, that's potentially 97% less I/O. Columns of the same type also compress far better than mixed-type row pages: the combination of ClickHouse's columnar layout and LZ4 (the default codec) achieves 5-10x compression ratios for typical web event data — LZ4 alone achieves roughly 2-3x, but the columnar layout makes each column's data more homogeneous, dramatically improving what any compression algorithm can achieve. ZSTD pushes the ratio further at the cost of CPU.

Vectorized execution is the layer above storage. ClickHouse processes data in batches - typically 65,536 rows at a time - and applies operations using SIMD instructions (SSE4.2 or AVX2, depending on CPU capabilities) across that entire batch in a single CPU instruction. The c6a series uses 3rd Gen AMD EPYC (Milan) processors with AVX2 support — but not AVX-512, which requires 4th Gen EPYC (Genoa) or newer Intel Xeon hardware. ClickHouse can therefore operate on 256 bits per instruction on c6a. For a SUM over a UInt32 column, that means 8 additions per CPU cycle instead of one. Across 100 million rows, this difference is not marginal. On modern hardware, ClickHouse routinely scans 2-10 GB/s per CPU core for aggregation-heavy queries, depending on data compression and filter selectivity.

The MergeTree storage engine's primary key (sort key) determines the physical order of data on disk. ClickHouse generates sparse index granules - by default one index entry per 8,192 rows. Queries that include the sort key columns in their WHERE clause can skip entire granules without reading them, which translates directly to reduced I/O. The table in ClickBench uses a sort key on (CounterID, EventDate, UserID, EventTime, WatchID), so queries filtering on the leading columns skip large portions of the dataset outright.

Reading the Results: Comparisons and Caveats

On the c6a.4xlarge hardware, ClickHouse consistently places among the top two or three systems across the full 43-query suite. DuckDB is genuinely competitive, particularly on simpler queries that fit in memory, and has improved its ClickBench ranking measurably over the past two years. Apache DataFusion, running over Parquet, claimed the top spot for single-node Parquet querying in ClickBench as of late 2024 (v43) — outperforming DuckDB, chDB, and ClickHouse on the standard c6a.4xlarge. This comparison is specifically against ClickHouse reading from Parquet; ClickHouse using its native MergeTree format remains faster. Redshift and BigQuery - being cloud-managed services with separate compute and storage - perform competitively on queries where their distributed execution compensates for network overhead, but generally trail single-node ClickHouse on this particular dataset size because 100M rows fits in local memory and there's no distribution benefit to exploit. Spark on the same hardware is slower, primarily because JVM startup, task scheduling, and row-oriented execution all introduce overhead that matters at this scale.

The critical limitation of ClickBench is that it represents one dataset with one schema and a narrow query pattern. The schema is deliberately denormalized - a single flat table avoids JOIN costs entirely, which is favorable to ClickHouse's execution model but atypical for normalized data warehouses. If your workload involves multi-table JOINs, ClickBench provides almost no signal about relative performance. The benchmark also tests only sequential single-user query execution. It says nothing about throughput under concurrent load, which is a completely different performance characteristic. A system might return results in 10 seconds for a single query but take 60 seconds per query under 20 concurrent users; ClickBench will not show you this. Managed services like BigQuery can distribute a query across hundreds of machines - the c6a.4xlarge comparison puts them at a structural disadvantage that evaporates at larger dataset sizes or under concurrent load.

Hardware configuration differences between submissions also distort comparisons. ClickBench has expanded beyond the original c6a.4xlarge to include c6a.metal, c8g.4xlarge, and other sizes, but not all systems have results on all hardware. When you compare numbers across systems that ran on different machines, the difference in results can be as large as the difference between systems on the same machine.

Running Benchmarks Against Your Own Workload

ClickBench is a reasonable starting point for a go/no-go evaluation, but running your own workload against a real data sample is the only benchmark that tells you what you actually need to know.

ClickHouse ships with a built-in clickhouse-benchmark tool that handles the mechanics of repeated execution and statistics collection:

clickhouse-benchmark --iterations 10 --concurrency 4 --query "
  SELECT
    toStartOfHour(event_time) AS hour,
    count() AS events,
    uniq(user_id) AS unique_users
  FROM events
  WHERE event_date >= today() - 7
  GROUP BY hour
  ORDER BY hour
"

The --concurrency flag runs parallel clients, so you can probe how the system degrades under load rather than just measuring single-query latency. The tool outputs percentile latencies (p50, p95, p99), which matter more than averages for interactive dashboards where tail latency determines user experience.

For workload replay from production, system.query_log captures every query ClickHouse executes along with execution time, rows read, bytes read, and memory used. You can extract slow queries from a staging or shadow deployment and replay them via clickhouse-benchmark with a file of queries:

clickhouse-client --query "
  SELECT query
  FROM system.query_log
  WHERE type = 'QueryFinish'
    AND query_duration_ms > 500
    AND event_date = today() - 1
  FORMAT LineAsString
" > slow_queries.sql

clickhouse-benchmark --iterations 3 < slow_queries.sql

When designing your own schema for benchmarking, the sort key choice has the largest single impact on query performance. Test your most common filter columns as sort key candidates and measure granule skip rates via system.parts and query logs. The right sort key for your access patterns can produce 10-50x differences in I/O that no amount of hardware upgrades will compensate for.

What ClickBench Results Actually Tell You

ClickBench tells you whether a system can execute scan-heavy analytical queries on denormalized columnar data efficiently on a single machine. That is a genuine and common workload - clickstream analytics, event tracking, log aggregation, and user behavior analysis all look roughly like the ClickBench schema. If your data fits that profile, a system that performs well on ClickBench will likely perform well on your data, and ClickHouse's performance on that class of problem is legitimately strong.

The benchmark does not tell you how a system handles normalized schemas with multi-table JOINs, high-cardinality updates, complex nested data structures, or concurrent mixed workloads of reads and writes. DuckDB's competitive performance on ClickBench is real, but DuckDB is an embedded, single-process engine - it does not serve concurrent clients from separate connections without an application layer, and it does not scale horizontally. Redshift's worse numbers on a single node tell you little about its performance on a 100-node cluster against petabyte-scale data. ClickHouse's own cloud offering, ClickHouse Cloud, introduces shared object storage and separate compute scaling that changes the performance profile relative to single-node results.

Treat ClickBench as one data point in an evaluation. Run it to understand order-of-magnitude differences and eliminate clearly unsuitable systems. Then run your own schema, your own queries, and your own data volume before committing to infrastructure that is genuinely difficult to migrate away from later.