ClickHouse JOINs: Memory-Bound Performance and OOM Errors

Q: What is the difference between `hash`, `parallel_hash`, `grace_hash`, and `partial_merge`?

hash loads the entire RHS into a single in-memory hash table and streams the LHS through it. parallel_hash (default since 24.11) builds multiple hash tables in parallel using max_threads buckets — roughly 2x faster but uses more than 2x the memory. grace_hash hashes rows into buckets and keeps only one bucket in memory at a time, spilling the rest to disk — it is non-memory-bound. partial_merge sorts the RHS entirely to disk with a min-max index and sorts LHS blocks in memory — it is the slowest option but uses the least memory.

Q: Does `join_algorithm='auto'` fall back to `grace_hash` when memory is exceeded?

No. The auto mode falls back to partial_merge , not grace_hash . To get disk-spill behavior with auto , you must set max_bytes_in_join to a non-zero value (the default is 0 , which means unlimited, so the fallback never triggers without this setting).

Q: I set `grace_hash_join_initial_buckets = 3` but the behavior seems like 4 buckets. Is that a bug?

No, this is by design. ClickHouse always rounds grace_hash_join_initial_buckets up to the nearest power of two. Setting it to 3 results in 4 buckets; setting it to 5 results in 8 buckets. Plan your bucket counts accordingly.

Q: My join query is slow but does not OOM. Could the algorithm still be the cause?

Yes. partial_merge (used as the auto fallback) sorts the RHS to disk and is much slower than hash join. If auto mode triggered a fallback to partial_merge due to a max_bytes_in_join limit, the query will complete but slowly. Similarly, a grace_hash join with many buckets involves significant disk I/O. Check system.query_log ProfileEvents for ExternalSortWritePart or HashJoinBuckets to confirm which algorithm ran and whether disk spill occurred.

ClickHouse JOIN queries can hit Memory limit exceeded errors or degrade to unacceptable runtimes with no obvious cause. A query that runs fine on a filtered dataset may OOM when filters are loosened. A join that works in development may kill production when table sizes grow. The root cause is almost always that ClickHouse's default join algorithm materializes the entire right-hand side of the join into an in-memory hash table — and that table has to fit within the query's memory budget.

What This Error Means

ClickHouse's default join algorithm is hash join (or parallel_hash as of version 24.11). The execution model works as follows:

The right-hand side (RHS) of the join is read in full and loaded into an in-memory hash table, keyed on the join column(s).
The left-hand side (LHS) is then streamed row by row, and each row is probed against the hash table.

The hash table must fit in memory entirely. If it exceeds the query's max_memory_usage limit (default: 10 GB per query in standard production configurations), ClickHouse raises:

DB::Exception: Memory limit (for query) exceeded: would use N.NN GiB
(attempt to allocate chunk of M bytes), maximum: 10.00 GiB

As of ClickHouse 24.11, the default join_algorithm is parallel_hash, which builds multiple hash tables concurrently using max_threads buckets. This is roughly 2x faster than plain hash join but uses more than 2x the memory — making OOM failures even more likely on large joins than they were with the older default.

Common Causes

Large unfiltered RHS table. If the RHS has millions or billions of rows and is not pre-filtered, the entire result set is loaded into the hash table before any LHS rows are processed. A WHERE clause on the LHS does not reduce the RHS.
parallel_hash memory amplification (ClickHouse 24.11+). The default algorithm now builds multiple hash tables in parallel. Each concurrent hash table consumes memory independently, meaning the total memory for the build phase can exceed 2x what a single hash join would use.
Wrong table order. Before ClickHouse 24.12, the query planner did not reorder join sides. Placing a large table on the right and a small table on the left meant the large table was always used as the build side. In versions prior to 24.12, developers had to manually ensure the smaller table was on the right.
Subquery on the RHS is fully materialized. When the RHS is a subquery, ClickHouse fully evaluates and materializes it before the join begins. Even if only a few rows would ultimately match, the entire subquery result is loaded into the hash table.
max_bytes_in_join and max_rows_in_join are unlimited by default. Both settings default to 0 (unlimited), so no automatic spill or error is triggered by hash table size alone. The query runs until it hits max_memory_usage instead, which may be a coarse limit with no spill behavior.
Using join_algorithm='auto' without setting max_bytes_in_join. The auto mode starts with hash join and only falls back to partial_merge when max_bytes_in_join is exceeded. If max_bytes_in_join is 0 (the default), the fallback never triggers and the query proceeds as a pure hash join until it OOMs.

How to Fix

1. Switch to `grace_hash` for large joins

grace_hash is the best general-purpose fix for memory-bound joins. It proactively spills non-active buckets to temporary disk storage — only one bucket is processed in memory at a time. It was introduced in ClickHouse 22.12.

SELECT a.id, b.name
FROM large_events AS a
INNER JOIN large_dimension AS b ON a.dim_id = b.id
SETTINGS join_algorithm = 'grace_hash';

To tune memory usage, increase the initial bucket count (ClickHouse rounds it up to the nearest power of two):

SELECT a.id, b.name
FROM large_events AS a
INNER JOIN large_dimension AS b ON a.dim_id = b.id
SETTINGS join_algorithm = 'grace_hash', grace_hash_join_initial_buckets = 8;

More buckets means smaller per-bucket memory footprint but more disk I/O. ClickHouse also automatically doubles the bucket count if an individual bucket still exceeds memory limits during processing.

2. Use `auto` mode with an explicit memory cap

auto mode starts with hash join and falls back to partial_merge if the hash table exceeds max_bytes_in_join. This is useful when most queries fit in memory but you want a safety net:

SELECT a.id, b.value
FROM table_a AS a
INNER JOIN table_b AS b ON a.key = b.key
SETTINGS join_algorithm = 'auto', max_bytes_in_join = 5000000000;

Note: auto falls back to partial_merge, not grace_hash. partial_merge sorts the RHS to disk and is slower but uses minimal memory.

3. Use `partial_merge` for extreme memory constraints

partial_merge sorts the entire right table externally to disk using a min-max index per sorted block. It uses very little memory but is significantly slower than hash-based approaches. Use it when memory is critically constrained and query latency is acceptable.

SELECT a.id, b.value
FROM table_a AS a
INNER JOIN table_b AS b ON a.key = b.key
SETTINGS join_algorithm = 'partial_merge';

Be aware that partial_merge does not support all join types. It only supports ALL strictness for RIGHT and FULL joins, and has limited support for ANY and SEMI on INNER and LEFT joins. It does not support RIGHT ANY, FULL ANY, SEMI, ANTI, or ASOF strictness.

4. Replace INNER JOIN with an IN subquery for filter-style joins

When an INNER JOIN is logically just a filter (i.e., you are selecting columns from one table and using the other only to restrict rows), an IN subquery is often faster. ClickHouse can exploit primary key indexes on both sides more efficiently:

-- Instead of:
SELECT n.station_id, n.date, n.tempMax
FROM noaa AS n
INNER JOIN stations AS s ON n.station_id = s.station_id
WHERE s.country_code = 'US';

-- Use:
SELECT station_id, date, tempMax
FROM noaa
WHERE station_id IN (
    SELECT station_id FROM stations WHERE country_code = 'US'
);

5. Use dictionaries for dimension table lookups

For small, frequently-joined dimension tables (country codes, product categories, user metadata), load the table as a ClickHouse Dictionary and use dictGet() or a JOIN against the Dictionary engine. This eliminates hash table construction entirely and is typically 6x or more faster:

SELECT
    event_id,
    dictGet('users_dict', 'name', user_id) AS user_name,
    dictGet('users_dict', 'country', user_id) AS country
FROM events;

6. Pre-join in a materialized view

For access patterns that always join the same tables, materialize the result:

CREATE MATERIALIZED VIEW events_enriched
ENGINE = MergeTree()
ORDER BY (event_id)
AS
SELECT e.event_id, e.created_at, u.name AS user_name, u.country
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.id;

7. Reorder tables manually (ClickHouse versions before 24.12)

In ClickHouse 24.12, the query planner began automatically placing the smaller table on the right (build side) for two-table joins. In earlier versions, table order is significant — always put the smaller table on the right:

-- Correct: smaller table (users) on the right
SELECT e.id, u.name
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.id;

-- Wrong: large table on the right causes large hash table
SELECT u.name, e.id
FROM users AS u
INNER JOIN events AS e ON u.id = e.user_id;

Root-Cause Analysis

Find recently failed OOM queries

SELECT
    event_time,
    user,
    query,
    formatReadableSize(memory_usage) AS mem,
    exception
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
    AND exception LIKE '%Memory limit%'
    AND event_time >= (now() - toIntervalDay(7))
ORDER BY event_time DESC
LIMIT 20;

Find the most memory-intensive JOIN queries (completed)

SELECT
    count() AS nb_query,
    user,
    query,
    sum(memory_usage) AS total_memory,
    normalized_query_hash
FROM system.query_log
WHERE event_time >= (now() - toIntervalDay(1))
    AND query_kind = 'Select'
    AND type = 'QueryFinish'
GROUP BY normalized_query_hash, query, user
ORDER BY total_memory DESC
LIMIT 20;

Inspect currently running queries

SELECT
    initial_query_id,
    query,
    elapsed,
    formatReadableSize(memory_usage) AS mem,
    formatReadableSize(peak_memory_usage) AS peak_mem
FROM system.processes
ORDER BY peak_memory_usage DESC
LIMIT 10;

Check for large in-memory table engine objects

Persistent Join, Set, or Memory engine tables contribute to server-wide memory pressure independently of query memory limits:

SELECT database, name, engine, formatReadableSize(total_bytes) AS size
FROM system.tables
WHERE engine IN ('Memory', 'Set', 'Join')
ORDER BY total_bytes DESC;

Check server-wide memory pressure

SELECT metric, description, formatReadableSize(value) AS size
FROM system.asynchronous_metrics
WHERE metric LIKE '%Mem%' OR metric LIKE '%Cach%'
ORDER BY value DESC;

Preventive Measures

Filter the RHS of hash joins as early as possible; WHERE clauses on the LHS do not reduce RHS memory usage.
Use grace_hash by default for any join where the RHS exceeds a few hundred million rows.
Set max_bytes_in_join to a meaningful value (e.g., 5 GB) when using auto mode so the fallback actually triggers.
Load small, stable dimension tables as Dictionaries and use dictGet() instead of building hash tables at query time.
On versions before 24.12, always place the smaller table on the right side of the JOIN.
Avoid parallel_hash (the 24.11+ default) for large RHS tables unless you have confirmed sufficient memory headroom, as it uses more than 2x the memory of plain hash.
Monitor system.query_log for queries where memory_usage is climbing toward max_memory_usage limits.

Resolve JOIN Memory Issues Automatically with Pulse

Pulse continuously monitors your ClickHouse clusters and detects JOIN-related memory pressure before it causes outages. Pulse identifies queries using memory-intensive join algorithms on large tables, surfaces OOM patterns in system.query_log, and recommends specific algorithm changes and settings adjustments. Instead of hunting through logs and query plans manually, Pulse gives you actionable diagnostics with context about which queries to fix and how.

Frequently Asked Questions

Q: Why does ClickHouse load the entire right-hand side into memory for a JOIN?
A: Hash joins are fast when the build side fits in memory — probing a hash table is O(1) per row. ClickHouse defaults to optimizing for speed. The trade-off is that large RHS tables require proportionally large memory. Algorithms like grace_hash and partial_merge trade speed for a bounded memory footprint.

Q: What is the difference between hash, parallel_hash, grace_hash, and partial_merge?
A: hash loads the entire RHS into a single in-memory hash table and streams the LHS through it. parallel_hash (default since 24.11) builds multiple hash tables in parallel using max_threads buckets — roughly 2x faster but uses more than 2x the memory. grace_hash hashes rows into buckets and keeps only one bucket in memory at a time, spilling the rest to disk — it is non-memory-bound. partial_merge sorts the RHS entirely to disk with a min-max index and sorts LHS blocks in memory — it is the slowest option but uses the least memory.

Q: Does join_algorithm='auto' fall back to grace_hash when memory is exceeded?
A: No. The auto mode falls back to partial_merge, not grace_hash. To get disk-spill behavior with auto, you must set max_bytes_in_join to a non-zero value (the default is 0, which means unlimited, so the fallback never triggers without this setting).

Q: I set grace_hash_join_initial_buckets = 3 but the behavior seems like 4 buckets. Is that a bug?
A: No, this is by design. ClickHouse always rounds grace_hash_join_initial_buckets up to the nearest power of two. Setting it to 3 results in 4 buckets; setting it to 5 results in 8 buckets. Plan your bucket counts accordingly.

Q: Since ClickHouse 24.12 reorders join sides automatically, do I still need to worry about table order?
A: For two-table joins on 24.12+, the planner handles reordering automatically. For three or more tables, automatic reordering requires 25.9+. If you are on an earlier version, you must manually place the smaller table on the right side. Even on newer versions, pre-filtering the RHS before it enters the join is still important for reducing hash table size.

Q: My join query is slow but does not OOM. Could the algorithm still be the cause?
A: Yes. partial_merge (used as the auto fallback) sorts the RHS to disk and is much slower than hash join. If auto mode triggered a fallback to partial_merge due to a max_bytes_in_join limit, the query will complete but slowly. Similarly, a grace_hash join with many buckets involves significant disk I/O. Check system.query_log ProfileEvents for ExternalSortWritePart or HashJoinBuckets to confirm which algorithm ran and whether disk spill occurred.