ClickHouse JOIN Types

ClickHouse supports the full set of standard SQL join types and adds several purpose-built extensions. Understanding which type to use — and how the strictness modifier changes behavior — is essential for writing correct, efficient queries. The default behavior differs from most SQL databases in two important ways: unmatched rows are filled with type defaults rather than NULL, and INNER JOIN produces a cartesian product by default when multiple rows match on the right side.

Syntax

The general form of a ClickHouse join is:

SELECT ...
FROM left_table
[GLOBAL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI|ANY|ALL|ASOF] JOIN right_table
    ON <expr_list> | USING <column_list>

The strictness keyword (ALL, ANY, SEMI, ANTI, ASOF) sits between the direction and the JOIN keyword. OUTER is optional for LEFT, RIGHT, and FULL joins. A bare JOIN with no type qualifier is treated as INNER JOIN. Join conditions use either ON (with explicit expressions) or USING (with a list of identically-named columns from both tables).

Two settings control default behavior:

join_use_nulls (default: 0): When 0, unmatched cells are filled with the type's default value (0 for integers, empty string for strings). Set to 1 for ANSI SQL NULL behavior. Enabling this adds Nullable column overhead.
join_default_strictness (default: '', treated as ALL): Controls the strictness when none is specified. The empty-string default means ALL, not ANY.

How It Works

ClickHouse executes joins by building an in-memory structure from one side (typically the right side) and probing it with rows from the other side. The join algorithm (join_algorithm setting, whose default direct,parallel_hash,hash priority list selects parallel_hash for most equi-joins since ClickHouse 24.11) determines the structure used. The strictness modifier determines what happens when a probe finds multiple matches.

ALL strictness (the default): Every matching combination is emitted. If a left row matches three right rows, three result rows are produced. This is a cartesian product over matches — not over the full tables, but over matched rows.

ANY strictness: Only the first matching right row is used per left-side row. The cartesian product over matches is disabled.

SEMI strictness: Only existence is checked. Columns from the non-returning side do not appear in output, and matched rows are not duplicated.

ANTI strictness: The inverse of SEMI — returns rows with no match on the opposite side.

ASOF strictness: Non-exact matching on an ordered column alongside an exact equality condition.

NULL values in join keys never match each other. Standard NULL != NULL semantics apply. To make NULL keys match, use isNotDistinctFrom() in the ON clause.

Standard SQL Joins

INNER JOIN

Returns only rows where the join key matches in both tables. With ALL strictness (the default), multiple matching right rows produce a cartesian product over those matches.

-- INNER JOIN with ALL (default): one result row per order-customer pair
SELECT o.order_id, c.name
FROM orders AS o
INNER JOIN customers AS c ON o.customer_id = c.id;

If a customer appears in the right table twice (duplicate key), each order row for that customer will appear twice in the result. To prevent this, use INNER ANY JOIN.

-- INNER ANY JOIN: at most one result row per left-table row
SELECT o.order_id, c.name
FROM orders AS o
INNER ANY JOIN customers AS c ON o.customer_id = c.id;

LEFT OUTER JOIN

Returns all rows from the left table. For left rows with no match in the right table, right-side columns are filled with default values (0, empty string, etc.) unless join_use_nulls = 1, in which case they are filled with NULL.

-- LEFT JOIN: all orders, with customer name where available
SELECT o.order_id, c.name
FROM orders AS o
LEFT JOIN customers AS c ON o.customer_id = c.id;

LEFT ANY JOIN combines LEFT OUTER semantics with first-match-only behavior: every left row appears in output, matched with the first right-side row found, or with default column values if no match exists.

RIGHT OUTER JOIN

Mirror of LEFT OUTER JOIN. All rows from the right table are returned; unmatched left columns are filled with defaults or NULL.

FULL OUTER JOIN

Returns all rows from both tables. Unmatched rows from either side are padded with defaults (or NULL when join_use_nulls = 1) for the opposite side's columns.

SELECT m.name, g.genre
FROM movies AS m
FULL OUTER JOIN genres AS g ON m.id = g.movie_id;

CROSS JOIN

Produces the full cartesian product of two tables. No ON or USING clause is used. The alternative comma-separated syntax in the FROM clause is treated identically.

SELECT a.name, b.name
FROM table_a AS a
CROSS JOIN table_b AS b;

-- Equivalent comma syntax
SELECT a.name, b.name
FROM table_a AS a, table_b AS b;

When cross_to_inner_join_rewrite = 1 (default), ClickHouse automatically rewrites CROSS JOIN to INNER JOIN if the WHERE clause contains join equality predicates.

SEMI and ANTI Joins

SEMI JOIN and ANTI JOIN filter rows based on the existence or absence of a match without returning columns from the opposite side and without duplicating rows.

LEFT SEMI JOIN: Returns left rows that have at least one match in the right table. Right-side columns do not appear in output. Equivalent to WHERE col IN (SELECT ...).

LEFT ANTI JOIN: Returns left rows that have no match in the right table. Inverse of LEFT SEMI JOIN. Equivalent to WHERE col NOT IN (SELECT ...).

RIGHT SEMI JOIN: Returns right rows that have at least one match in the left table. Left-side columns do not appear.

RIGHT ANTI JOIN: Returns right rows with no match in the left table.

-- Users who placed at least one order
SELECT u.id, u.name
FROM users AS u
LEFT SEMI JOIN orders AS o ON u.id = o.user_id;

-- Users who never placed an order
SELECT u.id, u.name
FROM users AS u
LEFT ANTI JOIN orders AS o ON u.id = o.user_id;

SEMI JOIN is more explicit and efficient for existence checks than INNER JOIN with a subsequent DISTINCT or GROUP BY. It does not require post-processing to remove duplicates because duplicates are never generated.

Note that algorithm support for SEMI and ANTI strictness varies. The partial_merge algorithm supports LEFT SEMI and INNER SEMI but does not support ANTI in any direction or RIGHT SEMI; full_sorting_merge supports neither SEMI nor ANTI. This matters under join_algorithm = 'auto', which falls back to partial_merge under memory pressure: a LEFT SEMI JOIN still runs after the fallback, but ANTI JOIN and RIGHT SEMI JOIN queries fail. Use grace_hash for memory-safe spill-to-disk behavior with SEMI and ANTI joins in any direction.

ANY JOIN and Strictness Modifiers

The strictness modifier is the mechanism that controls cartesian-product behavior for multi-match scenarios. LEFT ANY JOIN is a useful hybrid: it returns all left rows (including non-matching ones) but uses only the first matching right row for each left row, disabling the cartesian product.

-- LEFT ANY JOIN: every left row, first matching right row (or defaults)
SELECT m.name, g.genre
FROM movies AS m
LEFT ANY JOIN genres AS g ON m.id = g.movie_id;

The any_join_distinct_right_table_keys setting (default 0) affects ANY INNER JOIN behavior. When 0, ANY INNER JOIN returns one row per key from both tables and t1 ANY LEFT JOIN t2 is equivalent to t2 ANY RIGHT JOIN t1. When 1 (legacy behavior), ANY INNER JOIN acts like LEFT SEMI JOIN.

ASOF JOIN

ASOF JOIN performs non-exact matching on ordered sequences, most commonly time series. It requires an ON clause combining an equality condition (exact key match) and an inequality condition (closest match) connected by AND. The inequality operators >, >=, <, and <= are all supported.

The asof column must be one of: Int, UInt, Float, Date, DateTime, or Decimal.

-- For each trade, find the most recent quote at or before the trade time
SELECT
    t.symbol,
    t.volume,
    t.time AS trade_time,
    q.time AS quote_time,
    q.price
FROM trades AS t
ASOF LEFT JOIN quotes AS q
    ON t.symbol = q.symbol
    AND t.time >= q.time;

ASOF INNER JOIN returns only left rows that have a qualifying match; ASOF LEFT JOIN returns all left rows with default values when no match exists.

When using USING syntax, the asof column must be the last column listed: USING (symbol, time).

ASOF JOIN is only supported by the hash and full_sorting_merge algorithms. parallel_hash (which the default direct,parallel_hash,hash priority list selects for most equi-joins since 24.11) and grace_hash do not support ASOF. Specify the algorithm explicitly for these queries:

SET join_algorithm = 'hash';
SELECT t.symbol, q.price
FROM trades AS t
ASOF INNER JOIN quotes AS q
    ON t.symbol = q.symbol AND t.time >= q.time;

PASTE JOIN

PASTE JOIN (added in ClickHouse 23.12) concatenates two result sets horizontally by row position, with no join key. Row N from the left is paired with row N from the right. If the two sides return different numbers of rows, the result is truncated to the shorter count.

SELECT *
FROM (SELECT number AS a FROM numbers(5)) AS t1
PASTE JOIN (SELECT number AS b FROM numbers(5) ORDER BY b DESC) AS t2;
-- Row 0: a=0, b=4
-- Row 1: a=1, b=3
-- Row 2: a=2, b=2
-- Row 3: a=3, b=1
-- Row 4: a=4, b=0

Results can be nondeterministic when parallel reads are involved, since row order depends on execution scheduling. Use explicit ORDER BY on both subqueries to stabilize results.

ARRAY JOIN

ARRAY JOIN is a separate clause that unpacks array-type columns into multiple rows — one row per array element — replicating the values of all non-array columns. It is not a join between two tables; it is an array expansion operator.

-- Expand the tags array into individual rows
SELECT event_name, tag
FROM events
ARRAY JOIN tags AS tag;

If tags is ['error', 'timeout'], this produces two rows with the same event_name, one with tag = 'error' and one with tag = 'timeout'.

ARRAY JOIN (without LEFT): Rows where the array is empty or NULL are excluded from the result.

LEFT ARRAY JOIN: Rows with empty or NULL arrays are included; the expanded column is filled with the type's default value.

Multiple arrays in a single ARRAY JOIN clause are expanded positionally (element-wise), not as a cartesian product. By default, all arrays must have the same length. Set enable_unaligned_array_join = 1 to allow arrays of different sizes, with shorter arrays padded with defaults.

-- Positional expansion of two equal-length arrays
SELECT event_id, tag, score
FROM events
ARRAY JOIN tags, scores;

NATURAL JOIN

NATURAL JOIN (added in ClickHouse 26.4) automatically joins on all columns with identical names in both tables. It is equivalent to USING with auto-detected shared columns. Shared columns appear only once in the output.

SELECT *
FROM employees
NATURAL JOIN departments;
-- Implicitly: USING (department_id) if that is the only shared column

GLOBAL JOIN for Distributed Queries

On a distributed cluster, the default behavior evaluates the right-side table independently on each shard. If the right-side table is also distributed, this can cause N-squared network traffic. GLOBAL JOIN solves this by evaluating the right-side subquery once on the initiator node and broadcasting the result as a temporary table to all shards.

SELECT events.user_id, events.action, users.plan
FROM distributed_events AS events
GLOBAL LEFT JOIN users_table AS users ON events.user_id = users.id;

Use GLOBAL JOIN when the right side is a small lookup or dimension table. For large right-side tables, the broadcast cost dominates and GLOBAL JOIN is not appropriate.

Performance Considerations

Algorithm selection: Since ClickHouse 24.11, the default join_algorithm (the direct,parallel_hash,hash priority list) selects parallel_hash for most equi-joins, which builds multiple hash tables concurrently. It supports INNER and LEFT joins with all strictness settings except ASOF. For ASOF JOIN, use hash or full_sorting_merge. For memory-safe spill-to-disk behavior, use grace_hash.
Right-side table size: Hash-based algorithms hold the right-side table in memory. Keep the right side small, or filter it aggressively before the join.
join_use_nulls overhead: Setting join_use_nulls = 1 adds a UInt8 column per Nullable field for tracking null state, increasing memory usage and processing time. ClickHouse documentation recommends using type default values as null markers instead of enabling this setting.
Memory limits: Use max_bytes_in_join or max_rows_in_join to cap the hash table size. Set join_overflow_mode = 'break' to return partial results on overflow rather than throwing an error.
Diagnosing join queries: Use EXPLAIN PIPELINE to inspect the algorithm chosen, and query system.query_log for memory usage and duration of past join queries.

-- Inspect join-related settings in the current session
SELECT name, value, default, changed
FROM system.settings
WHERE name IN (
    'join_use_nulls',
    'join_default_strictness',
    'join_algorithm',
    'any_join_distinct_right_table_keys',
    'cross_to_inner_join_rewrite',
    'max_rows_in_join',
    'max_bytes_in_join',
    'join_overflow_mode'
);

-- Find recent memory-intensive JOIN queries
SELECT
    query_id,
    user,
    memory_usage,
    query_duration_ms,
    query
FROM system.query_log
WHERE type = 'QueryFinish'
    AND memory_usage > 1073741824
    AND positionCaseInsensitive(query, 'JOIN') > 0
ORDER BY memory_usage DESC
LIMIT 10;

How Pulse Helps

Pulse monitors ClickHouse query performance and automatically surfaces slow or memory-intensive JOIN queries — including those using inefficient algorithms, producing unexpectedly large hash tables, or falling back to unsupported algorithm combinations. When a join-related issue appears in your cluster, Pulse identifies it with context from system.query_log and system.processes so you can diagnose and resolve it without manual log analysis.

Frequently Asked Questions

Q: Why does my INNER JOIN return more rows than I expect?
A: The default strictness is ALL, which produces a cartesian product over matched rows. If a left row matches three right rows, you get three result rows. This is correct ClickHouse behavior, not a bug. Use INNER ANY JOIN to return at most one result row per left-table row, or deduplicate the right-side table before the join.

Q: Why are unmatched columns 0 or empty string instead of NULL in my LEFT JOIN?
A: ClickHouse defaults to join_use_nulls = 0, which fills unmatched cells with the data type's default value. Set join_use_nulls = 1 to get NULL for unmatched cells (ANSI SQL behavior), but be aware this converts affected columns to Nullable types and incurs a performance cost.

Q: Why does my ASOF JOIN fail with an algorithm error?
A: ASOF JOIN is only supported by the hash and full_sorting_merge algorithms. Since ClickHouse 24.11 the default join_algorithm (the direct,parallel_hash,hash priority list) selects parallel_hash for most equi-joins, and parallel_hash does not support ASOF. Set join_algorithm = 'hash' or join_algorithm = 'full_sorting_merge' explicitly before running an ASOF JOIN.

Q: What is the difference between LEFT SEMI JOIN and INNER JOIN with DISTINCT?
A: Both return left-table rows that have a match in the right table. LEFT SEMI JOIN is more efficient because it stops searching after the first match and never materializes right-side column values. INNER JOIN with DISTINCT first generates all matching row combinations, then deduplicates, doing more work for the same result. Use LEFT SEMI JOIN whenever you only need to check existence.

Q: Can I use NULL values as join keys?
A: By default, NULL does not match NULL in join key columns. Rows with NULL join keys are effectively unmatched. To match NULL with NULL, use isNotDistinctFrom(a.key, b.key) in the ON clause instead of a.key = b.key. Note that inequality conditions in the ON clause are not supported when join_use_nulls = 1.

Q: When should I use GLOBAL JOIN on a distributed cluster?
A: Use GLOBAL JOIN when the right-side table is small (a lookup table, a filtered subquery, a small dimension table) and you need to avoid each shard independently querying the right-side distributed table. Without GLOBAL, a distributed-to-distributed join causes each shard to fan out to all other shards, resulting in N-squared network traffic. GLOBAL JOIN is not appropriate for large right-side tables because broadcasting a large dataset to all shards is expensive.