ClickHouse JOIN Performance

ClickHouse was designed for analytical queries over wide, denormalized tables. JOINs work, but every join introduces a hash table build phase, memory pressure, and a departure from the columnar scan path that makes ClickHouse fast. Understanding how ClickHouse executes joins and which settings control that behavior is the difference between a query that completes in milliseconds and one that exhausts memory or times out entirely.

The core model for most join algorithms is two-phase: build a data structure from the right-hand side of the join, then probe it while streaming the left-hand side. This means the right table's size determines memory consumption and build time. Placing the wrong table on the right side can produce a severe slowdown. Starting with ClickHouse 24.12, the query planner automatically reorders two-table joins to place the smaller table on the right, but knowing the mechanics lets you reason about what the planner is doing — and override it when needed.

The Six Join Algorithms

ClickHouse exposes six join algorithms through the join_algorithm setting. The value can be a single algorithm name or a comma-separated priority list: ClickHouse tries each in order and uses the first one applicable to the join type and right-table engine. The default value is 'default', which resolves to the priority list 'direct,parallel_hash,hash' as of ClickHouse 24.12 (before 24.12, it resolved to 'direct,hash').

Direct join

Direct join is the fastest option when it applies. It bypasses hash table construction entirely by performing parallel key-value lookups against a pre-built structure — a Dictionary engine, Join engine, or EmbeddedRocksDB engine. The right-table data structure is already in memory; the join becomes a point lookup rather than a scan.

Restrictions are significant: direct join supports INNER and LEFT joins only (no RIGHT, FULL, or ASOF), and only single-column equality keys with no additional conditions. The join key in the query must exactly match the key attribute of the underlying storage. One-to-many relationships are not supported because dictionaries silently deduplicate duplicate keys, retaining only the last loaded value.

When applicable, the performance difference is dramatic. A flat-layout dictionary direct join benchmarks approximately 25x faster than an equivalent hash join on the same data.

-- Create a dictionary for the lookup table
CREATE DICTIONARY user_dict (
    user_id UInt64,
    user_name String,
    country String
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(TABLE 'users'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED());

-- Use direct join against the dictionary
SELECT e.event_id, u.user_name
FROM events AS e
LEFT ANY JOIN user_dict AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'direct';

Hash join and parallel_hash join

Hash join reads the entire right table into a single in-memory hash table in one thread, then streams the left table through it. It is effective when the right table fits comfortably in memory. The single-threaded build phase becomes a bottleneck for large right tables.

Parallel hash join splits the input into buckets — one per CPU thread, controlled by max_threads — and builds multiple hash tables concurrently. It is roughly 2x faster than standard hash join on large right tables but uses more than 2x the memory. It supports only INNER and LEFT JOINs; using it with RIGHT or FULL joins falls back to standard hash join.

SELECT e.event_id, u.user_name
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'parallel_hash';

Grace hash join

Grace hash join is the recommended algorithm for joins where the right table may exceed available memory. It is a two-phase spill-to-disk algorithm: Phase 1 processes one bucket in memory while spilling the rest to disk. Phase 2 loads and joins each remaining bucket sequentially. It supports INNER, LEFT, RIGHT, and FULL JOINs (RIGHT/FULL support was added in the 23.x series).

The grace_hash_join_initial_buckets setting controls the memory-speed tradeoff. ClickHouse always rounds this value up to the nearest power of two (setting 3 results in 4 actual buckets). More buckets means smaller per-bucket memory usage but more disk I/O and slower execution. ClickHouse also dynamically increases bucket count if max_bytes_in_join is exceeded mid-join, up to the limit set by grace_hash_join_max_buckets (default: 1024).

SELECT e.event_id, u.user_name
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'grace_hash',
         grace_hash_join_initial_buckets = 8,
         max_bytes_in_join = 10000000000;

Full sorting merge join

Full sorting merge is the classical sort-merge join: both tables are sorted by the join key, then joined via interleaved linear scans. It supports all join types (INNER, LEFT, RIGHT, FULL) with ALL and ANY strictness.

Its key advantage is that it can skip the sort phase entirely if the table's physical row order already matches the join keys — which happens when the join key is a prefix of the table's ORDER BY clause. In that case, full sorting merge is competitive with hash join in speed while using a fraction of the memory. When physical order does not match, the sort overhead makes it slower.

SELECT o.order_id, c.customer_name
FROM orders AS o
INNER JOIN customers AS c ON o.customer_id = c.customer_id
SETTINGS join_algorithm = 'full_sorting_merge';

A pre-filtering optimization builds an in-memory set of join key values to filter both tables before sorting. The max_rows_in_set_to_optimize_join setting controls this pre-filter; it defaults to 0 (disabled) as of ClickHouse 24.2, because the optimization can prevent read-in-order optimizations. Set it to a non-zero value to enable the pre-filter.

Partial merge join

Partial merge join uses the least memory of any algorithm. It always sorts the entire right table via external sorting first, then sorts the left table block-by-block in memory, using min-max indexes on the sorted disk blocks to minimize in-memory processing. The trade-off is speed: partial merge is the slowest algorithm and should be used only under hard memory constraints.

It supports INNER and LEFT JOINs with ANY strictness and INNER, LEFT, RIGHT, and FULL JOINs with ALL strictness.

Note: join_algorithm = 'auto' is a special mode that starts with hash join and switches to partial merge in flight if max_bytes_in_join is exceeded. For graceful memory-bounded joins, grace hash join with explicit bucket tuning is preferable to auto, because the in-flight algorithm switch in auto mode can cause a significant and unpredictable performance drop.

Algorithm Selection Guide

Scenario Recommended algorithm
Right table is a Dictionary / EmbeddedRocksDB, INNER or LEFT JOIN direct
Right table fits in memory, INNER or LEFT JOIN parallel_hash
Right table may exceed memory, any join type grace_hash
Both tables sorted by join key, any join type full_sorting_merge
Memory is critically constrained, any join type partial_merge
Safety net for mixed workloads (with max_bytes_in_join set) auto

Filter Before You Join

Reducing the volume of data entering the join is almost always more impactful than tuning the algorithm itself. A documented ClickHouse benchmark illustrates this: a naive multi-table join ran in 56 seconds. Reordering the join to put the smaller table on the right reduced it to 1.5 seconds. Adding date filters to both sides reduced it further to 0.6 seconds. Rewriting as an IN subquery achieved 0.38 seconds — a 147x improvement from the same data and the same result set.

Apply WHERE conditions to each side of the join independently. Do not rely solely on the query planner's predicate pushdown; writing explicit pre-join filters makes the optimization deterministic.

Use IN subqueries instead of INNER JOIN when you only need to filter rows, not retrieve columns from the second table. An IN subquery can exploit primary key indexes on the filter column independently, reducing rows read before the join is evaluated.

-- Slower: INNER JOIN builds a full hash table with row payloads
SELECT p.post_id, p.title
FROM posts AS p
ANY INNER JOIN users AS u ON p.user_id = u.user_id
WHERE u.country_code = 'US';

-- Faster: IN subquery uses primary key index, no payload in the set
SELECT post_id, title
FROM posts
WHERE user_id IN (
    SELECT user_id
    FROM users
    WHERE country_code = 'US'
);

A benchmark on equivalent queries showed 8.2 seconds for the JOIN form versus 2.3 seconds for the IN subquery form on the same dataset.

Use ANY JOIN instead of ALL JOIN when only the first matching row per key is needed. ALL JOIN (the default, consistent with SQL standard) produces a Cartesian product for duplicate matches, which can multiply result size and memory unexpectedly.

-- ALL JOIN (default): may produce many rows if right table has duplicates
SELECT user_id, order_id FROM users ALL LEFT JOIN orders USING (user_id);

-- ANY JOIN: returns at most one match per left row, faster and smaller
SELECT user_id, order_id FROM users LEFT ANY JOIN orders USING (user_id);

Join Order and the Query Planner

The right side of a hash join is the build side. The larger that table, the larger the hash table and the more memory required. Always place the smaller table on the right.

As of ClickHouse 24.12, the query planner automatically reorders two-table joins via the query_plan_join_swap_table setting (default: 'auto'). For three or more tables, global join reordering using a greedy algorithm was introduced in ClickHouse 25.9, controlled by query_plan_optimize_join_order_limit and allow_statistics_optimize. Benchmarks showed up to 1450x speedup and 25x less memory with correct join ordering on multi-table queries.

If you are on a version prior to 24.12, or joining three or more tables on any version before 25.9, table order is your responsibility. Use EXPLAIN PLAN to verify which order the planner chose:

EXPLAIN PLAN
SELECT e.event_id, u.user_name
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.user_id
SETTINGS join_algorithm = 'hash';

Memory Settings

Two settings bound hash table size: max_rows_in_join and max_bytes_in_join. Both default to 0 (unlimited). When either is exceeded, join_overflow_mode determines the response: 'throw' (default) raises an exception; 'break' returns partial results.

Setting max_bytes_in_join to a meaningful value serves two purposes: it protects against OOM when using auto mode (triggering the fallback to partial_merge), and it triggers dynamic bucket expansion in grace hash join.

-- Set a memory limit with exception on overflow
SELECT e.event_id, u.user_name
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.user_id
SETTINGS max_bytes_in_join = 10737418240,   -- 10 GiB
         join_overflow_mode = 'throw';

join_use_nulls defaults to 0. With this default, unmatched outer join cells are filled with the column's type default value: 0 for numeric types, empty string for strings. Setting it to 1 produces ANSI SQL-standard NULL values for unmatched rows, converting affected columns to Nullable — which carries memory overhead, disables certain optimizations, and does not support inequality join conditions. Use the default unless NULL semantics are required for correctness.

Denormalization and Dictionaries

For hot query paths that always join the same tables, consider eliminating the join entirely at query time.

Dictionaries hold dimension data in memory as a key-value structure that is refreshed on a schedule. Use dictGet() for point lookups without any join syntax at all:

SELECT
    event_id,
    dictGet('user_dict', 'user_name', user_id) AS user_name,
    dictGet('user_dict', 'country', user_id) AS country
FROM events;

One critical constraint: dictionaries silently deduplicate keys on load, retaining only the last value per key. Never use a dictionary for one-to-many or many-to-many relationships — the data loss is silent.

Materialized views shift join computation from query time to insert time. This adds complexity to the write path but eliminates join overhead from every subsequent read:

CREATE MATERIALIZED VIEW events_enriched
ENGINE = MergeTree()
ORDER BY (event_id)
AS
SELECT e.event_id, e.created_at, u.name AS user_name, u.country
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.id;

Note that each time a query contains a JOIN subquery, ClickHouse re-executes it — subquery results are not cached between query runs. For static lookup tables, the Join table engine or dictionaries avoid this re-execution overhead.

Distributed JOINs

In a distributed ClickHouse cluster, a regular JOIN sends the right-side subquery to every shard independently. Each shard builds its own copy of the right-side hash table. For a large cluster and a non-trivial right table, this produces significant redundant computation and memory pressure.

GLOBAL JOIN computes the right side once on the initiator node and broadcasts it as a temporary table to all shards, eliminating redundant computation:

-- Use GLOBAL JOIN when the right table is small relative to cluster size
SELECT l.order_id, r.product_name
FROM distributed_orders AS l
GLOBAL INNER JOIN small_products AS r ON l.product_id = r.product_id;

GLOBAL JOIN is only beneficial when the right side is small enough to fit on the initiator node and all shards as a broadcast. For large right tables, the network transfer cost of the broadcast becomes the bottleneck. ClickHouse does not perform shuffle joins (redistributing both sides by join key across nodes); this is a known limitation as of 2025.

Diagnosing Join Performance

The system.query_log table is the primary diagnostic surface for join performance issues. The ProfileEvents column captures join-specific metrics including disk spill:

SELECT
    query,
    formatReadableTimeDelta(query_duration_ms / 1000) AS duration,
    formatReadableSize(memory_usage) AS peak_memory,
    formatReadableSize(ProfileEvents['ExternalProcessingUncompressedBytesTotal']) AS spilled_uncompressed,
    ProfileEvents['ExternalJoinWritePart'] AS join_spill_parts
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query ILIKE '%JOIN%'
  AND event_date >= today() - 1
ORDER BY memory_usage DESC
LIMIT 20;

Key ProfileEvents for join diagnostics:

  • ExternalJoinWritePart — number of data parts written to disk during a join spill (non-zero means grace_hash or partial_merge spilled)
  • ExternalJoinUncompressedBytes / ExternalJoinCompressedBytes — join-specific spill volume
  • ExternalProcessingUncompressedBytesTotal — total data spilled to disk across all external processing

For monitoring currently running join queries in real time:

SELECT
    query_id,
    elapsed,
    formatReadableSize(memory_usage) AS memory,
    formatReadableSize(peak_memory_usage) AS peak_memory,
    query
FROM system.processes
WHERE query ILIKE '%JOIN%'
ORDER BY memory_usage DESC;

When using direct join against dictionaries, note that system.query_log memory_usage does not include dictionary memory. Check system.dictionaries to account for it:

SELECT
    database, name, type, status,
    formatReadableSize(bytes_allocated) AS dict_memory,
    element_count
FROM system.dictionaries
WHERE status = 'LOADED'
ORDER BY bytes_allocated DESC;

How Pulse Helps

Pulse continuously monitors your ClickHouse clusters and surfaces join performance issues before they cause outages. Pulse identifies queries using memory-intensive join algorithms, detects disk spill patterns in system.query_log, and recommends specific algorithm changes — including whether a dictionary or materialized view would eliminate the join entirely. Instead of diagnosing join regressions manually after they impact users, Pulse provides proactive recommendations with the context needed to act.

Frequently Asked Questions

Q: What is the default join algorithm in ClickHouse 24.12 and later?
A: The default join_algorithm value is 'default', which resolves to the priority list 'direct,parallel_hash,hash' in ClickHouse 24.12 and later. Before 24.12, 'default' resolved to 'direct,hash'. ClickHouse tries each algorithm in the list and uses the first one applicable to the join type and right-table engine.

Q: When should I use grace_hash instead of parallel_hash?
A: Use parallel_hash when the right table fits in memory and you want the fastest possible hash join for INNER or LEFT joins. Use grace_hash when the right table may exceed available memory — it spills to disk gracefully rather than failing with an OOM error, and it supports INNER, LEFT, RIGHT, and FULL joins. full_sorting_merge is the alternative when both inputs are already sorted by the join key.

Q: Does join_algorithm='auto' fall back to grace_hash when memory is exceeded?
A: No. The auto mode falls back to partial_merge, not grace_hash. The fallback only triggers if max_bytes_in_join is set to a non-zero value — the default is 0 (unlimited), meaning the fallback never fires without explicit configuration. For a graceful memory-bounded join, use grace_hash directly with grace_hash_join_initial_buckets tuned to your memory budget.

Q: I set grace_hash_join_initial_buckets = 3 but it behaves like 4 buckets. Why?
A: ClickHouse always rounds grace_hash_join_initial_buckets up to the nearest power of two. Setting 3 produces 4 actual buckets; setting 5 produces 8. Plan bucket counts as powers of two: 1, 2, 4, 8, 16, and so on.

Q: Can I use direct join for an INNER JOIN against a dictionary?
A: Yes. Direct join supports INNER and LEFT joins (it does not support RIGHT, FULL, or ASOF, and requires a single-column equality key). It must run against a key-value right side — a Dictionary, Join engine, or EmbeddedRocksDB table. For lookups outside those constraints, use dictGet() with a WHERE filter or a hash/parallel_hash join against the dictionary table directly.

Q: Why does rewriting an INNER JOIN as an IN subquery sometimes perform better?
A: An IN subquery builds a compact set of keys (no column payloads) and ClickHouse can exploit primary key indexes on the filter column of both the outer table and the subquery independently. An INNER JOIN builds a full hash table that includes all projected columns from the right side. For filter-only joins where no right-side columns appear in the SELECT list, IN is almost always faster.

Q: How do I verify which join algorithm ClickHouse actually used?
A: Use EXPLAIN PLAN before running the query to inspect the logical plan. After execution, check system.query_log ProfileEvents for non-zero values in ExternalJoinWritePart (disk spill occurred, indicating grace_hash or partial_merge ran) and compare memory_usage values across runs to confirm algorithm behavior.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.