ClickHouse SEMI JOIN

ClickHouse SEMI JOIN is a join strictness that filters one table based on key existence in another without producing a cartesian product and without duplicating rows. Where INNER JOIN combines columns from both sides and multiplies rows when multiple matches exist, SEMI JOIN returns each qualifying row from the source table exactly once. It is formally described in ClickHouse documentation as "an allowlist on join keys, without producing a cartesian product."

The word SEMI itself is the strictness modifier. You do not combine it with ALL or ANY — the valid forms are LEFT SEMI JOIN and RIGHT SEMI JOIN, nothing more.

Syntax

SELECT <columns_from_left_table>
FROM left_table
LEFT SEMI JOIN right_table ON left_table.key = right_table.key;

SELECT <columns_from_right_table>
FROM left_table
RIGHT SEMI JOIN right_table ON left_table.key = right_table.key;

LEFT SEMI JOIN returns only columns from the left table, one row per left-side row that has at least one matching key in the right table. RIGHT SEMI JOIN does the inverse: it returns only columns from the right table, one row per right-side row that has at least one matching key in the left table. In both cases, no columns from the non-returning side appear in the result, and no row from the returning side is duplicated regardless of how many matches exist on the other side.

How It Works

The hash join algorithm — the default implementation — builds an in-memory hash set from the right-side table's join keys. For SEMI JOIN, ClickHouse only needs to check membership (does this key exist?), not retrieve and combine column values. This makes SEMI JOIN naturally efficient: the hash set can be smaller than a full hash table because it stores keys only, and once a match is found for a given left-side row, no further matching is done for that row.

Since ClickHouse 24.11, the default join_algorithm ('default', which resolves to 'direct,parallel_hash,hash') uses parallel_hash where applicable — a multithreaded variant of hash join that supports SEMI JOIN fully. The older hash algorithm also supports SEMI JOIN, as does grace_hash (which spills to disk when memory is exceeded).

partial_merge supports LEFT SEMI and INNER SEMI but not RIGHT SEMI; full_sorting_merge does not support SEMI JOIN at all. This matters under join_algorithm = 'auto', which starts with hash and falls back to partial_merge when memory is exceeded: a LEFT SEMI JOIN still runs after the fallback, but a RIGHT SEMI JOIN fails. To keep memory bounded with full SEMI JOIN support regardless of direction, set join_algorithm = 'grace_hash' (which supports SEMI JOIN and can spill to disk).

Practical Examples

Existence filter without row duplication

The canonical use case: find all actors who appeared in at least one role in 2023, returning each actor exactly once even if they had multiple roles.

SELECT
    a.first_name,
    a.last_name
FROM actors AS a
LEFT SEMI JOIN roles AS r ON a.id = r.actor_id
WHERE toYear(r.created_at) = 2023
ORDER BY a.id ASC
LIMIT 10;

An INNER JOIN on the same tables would return one row per role, not one row per actor. LEFT SEMI JOIN eliminates the duplication without requiring a subsequent DISTINCT or GROUP BY.

RIGHT SEMI JOIN for filtering the right table

Find all products that have been ordered at least once:

SELECT
    p.product_id,
    p.name,
    p.category
FROM orders AS o
RIGHT SEMI JOIN products AS p ON o.product_id = p.product_id;

This returns only rows from products and only those with at least one matching row in orders. You could rewrite this as a LEFT SEMI JOIN by swapping table positions, which is often clearer. RIGHT SEMI JOIN is useful when the query structure or CTE chain makes swapping tables inconvenient.

Inequality conditions in the ON clause

Since ClickHouse 24.8, SEMI JOIN supports inequality conditions involving columns from both sides of the join. Since 24.12, this is enabled by default (the allow_experimental_join_condition setting that previously gated this was removed).

-- Find employees whose salary exceeds the 90th percentile benchmark for their department
SELECT e.employee_id, e.salary
FROM employees AS e
LEFT SEMI JOIN salary_benchmarks AS b
    ON e.department = b.department AND e.salary > b.p90_salary;

On ClickHouse versions earlier than 24.8, this query would fail. On 24.5–24.11, it required SET allow_experimental_join_condition = 1.

INNER JOIN vs SEMI JOIN: The Key Difference

The distinction matters whenever the join key is not unique on the right side. Consider users and orders, where a single user can have many orders.

-- INNER JOIN: returns one row per order — user 42 with 3 orders appears 3 times
SELECT u.user_id, u.name
FROM users AS u
INNER JOIN orders AS o ON u.user_id = o.user_id;

-- LEFT SEMI JOIN: returns one row per user — user 42 appears exactly once
SELECT u.user_id, u.name
FROM users AS u
LEFT SEMI JOIN orders AS o ON u.user_id = o.user_id;

If your goal is "which users have placed at least one order," LEFT SEMI JOIN expresses that intent directly. Using INNER JOIN requires a follow-up DISTINCT or GROUP BY to achieve the same result, adding cost and obscuring intent.

Comparison to IN Subqueries and EXISTS

LEFT SEMI JOIN is semantically equivalent to a WHERE col IN (SELECT ...) subquery and to a correlated EXISTS subquery.

-- All three forms are semantically equivalent:

-- Form 1: LEFT SEMI JOIN
SELECT u.user_id, u.name
FROM users AS u
LEFT SEMI JOIN orders AS o ON u.user_id = o.user_id;

-- Form 2: IN subquery
SELECT user_id, name
FROM users
WHERE user_id IN (SELECT user_id FROM orders);

-- Form 3: EXISTS subquery (requires allow_experimental_correlated_subqueries=1 in current versions)
SELECT user_id, name
FROM users AS u
WHERE EXISTS (SELECT 1 FROM orders AS o WHERE o.user_id = u.user_id);

The IN subquery builds a hash set of keys and checks membership, which is the same operation SEMI JOIN performs. For single-column existence checks, IN is idiomatic ClickHouse and often the most readable form. SEMI JOIN becomes preferable when the join condition spans multiple columns — expressing a multi-column existence check with IN requires tuple syntax (WHERE (a, b) IN (SELECT a, b FROM ...)), which some find less readable than an explicit JOIN ON clause. SEMI JOIN also makes the join semantics explicit in the query plan, which can aid debugging with EXPLAIN PLAN.

Note: correlated EXISTS subqueries require allow_experimental_correlated_subqueries = 1 and have known limitations in recent versions — the SEMI JOIN or IN forms are more reliable in production.

Performance Considerations

Use EXPLAIN PLAN to inspect which join algorithm ClickHouse selects for a SEMI JOIN query:

EXPLAIN PLAN
SELECT u.user_id, u.name
FROM users AS u
LEFT SEMI JOIN orders AS o ON u.user_id = o.user_id;

To diagnose SEMI JOIN query performance after execution, query system.query_log:

SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage,
    ProfileEvents['HashJoinRows'] AS hash_join_rows,
    ProfileEvents['HashJoinBytesUsed'] AS hash_join_bytes
FROM system.query_log
WHERE
    type = 'QueryFinish'
    AND query ILIKE '%SEMI JOIN%'
    AND event_date >= today() - 1
ORDER BY query_duration_ms DESC
LIMIT 10;

HashJoinRows and HashJoinBytesUsed show how large the hash set built from the right-side table was. If HashJoinBytesUsed is large, consider filtering the right-side table before the join to reduce hash set size, or switch to grace_hash to allow spilling to disk.

For distributed queries, GLOBAL LEFT SEMI JOIN sends the right-side subquery to the initiator node, builds the hash table there, and distributes it to all shards. Keep the filtering (right-side) table as small as possible when using GLOBAL to avoid distributing a large hash table across all replicas.

Known Issues and Version Notes

join_use_nulls interaction: When join_use_nulls = 1, unmatched outer join rows are filled with NULL rather than default type values. SEMI JOIN does not return columns from the non-matching side, so this setting has no practical effect on SEMI JOIN results. Note that older ClickHouse versions (before mid-2023) had stability issues combining SEMI LEFT JOIN with join_use_nulls = 1; current versions handle this combination without problems.

SELECT * with SEMI JOIN: A known pitfall is that SELECT * with SEMI JOIN or ANTI JOIN can incorrectly include columns from the non-returning side in the result set. Semantically those columns are meaningless (they are not part of SEMI JOIN output). Explicitly name the columns you need rather than relying on SELECT *.

partial_merge fallback: With join_algorithm = 'auto', ClickHouse falls back from hash to partial_merge if the hash table exceeds available memory. partial_merge supports LEFT/INNER SEMI but not RIGHT SEMI (and full_sorting_merge supports no SEMI at all), so a RIGHT SEMI JOIN can fail under memory pressure. Use grace_hash if you need spill-to-disk behavior with SEMI JOIN in any direction.

Inequality conditions: Supported experimentally since 24.5 (allow_experimental_join_condition = 1), extended to SEMI JOIN in 24.8, and made non-experimental in 24.12. Not available on 23.x.

NULL join keys: Standard SQL semantics apply — NULL != NULL, so rows with NULL join keys never match in a SEMI JOIN.

How Pulse Helps

Pulse monitors ClickHouse query performance and surfaces slow or memory-intensive JOIN operations automatically, including those caused by incorrect algorithm selection or unexpectedly large hash tables in SEMI JOIN queries. When a join exceeds memory thresholds or falls back to an unsupported algorithm, Pulse flags the issue with context from system.query_log so you can diagnose and fix it without manual log trawling.

Frequently Asked Questions

Q: Can I use ALL LEFT SEMI JOIN or ANY LEFT SEMI JOIN?
A: No. SEMI is itself the strictness modifier. It cannot be combined with ALL or ANY. The valid syntax is LEFT SEMI JOIN or RIGHT SEMI JOIN. Attempting ALL LEFT SEMI JOIN will produce a parse error.

Q: Why does my SEMI JOIN query fail with an error about the algorithm not being supported?
A: A common cause is join_algorithm = 'full_sorting_merge' (which does not support SEMI JOIN at all), or a RIGHT SEMI JOIN running on partial_merge — including when join_algorithm = 'auto' falls back to partial_merge under memory pressure (partial_merge supports LEFT/INNER SEMI but not RIGHT SEMI). Set join_algorithm = 'hash', 'parallel_hash', or 'grace_hash' explicitly to avoid this.

Q: Is LEFT SEMI JOIN faster than WHERE col IN (SELECT ...)?
A: They perform the same logical operation (hash-based key existence check) and are usually comparable in performance. IN subqueries have historically been measured as slightly faster than INNER JOIN rewrites in some benchmarks, but SEMI JOIN is specifically optimized for existence filtering and avoids materializing right-side columns. Performance differences are typically small; choose the form that is most readable for your query.

Q: Does SEMI JOIN work in distributed queries with GLOBAL?
A: Yes. GLOBAL LEFT SEMI JOIN is supported. The right-side subquery executes on the initiator node, the result is broadcast to all shards as a hash table, and each shard performs the semi-join locally. For large right-side datasets this broadcast can be expensive — keep the filtering table small.

Q: What happens if the right-side table has NULL values in the join key column?
A: NULL values never match in SEMI JOIN. Standard SQL semantics apply: NULL != NULL. A left-side row whose join key is NULL will not appear in the result, and right-side rows with NULL join keys are effectively invisible as filters.

Q: Can I reference right-table columns in the SELECT list of a LEFT SEMI JOIN?
A: No — semantically, LEFT SEMI JOIN returns only left-table columns. Attempting to reference right-table columns should produce an error or (due to a known pitfall in some versions) return meaningless data. If you need columns from both tables, use INNER JOIN or ANY JOIN instead.