The "DB::Exception: Distributed IN/JOIN subquery denied" error in ClickHouse is triggered when a query uses an IN or JOIN clause with a subquery that references a distributed table, but does not use the GLOBAL keyword. The error code is DISTRIBUTED_IN_JOIN_SUBQUERY_DENIED. ClickHouse raises this to prevent incorrect results that can arise from executing a distributed subquery locally on each shard.
Impact
This error prevents the query from running at all, which means:
- Queries combining distributed tables through IN or JOIN subqueries will fail until corrected
- Application logic that dynamically constructs queries with subqueries may break when distributed tables are involved
- Developers unfamiliar with ClickHouse's distributed query semantics may be blocked until they understand the GLOBAL keyword requirement
Common Causes
- Using IN with a distributed subquery without GLOBAL -- A query like
SELECT ... FROM dist_table WHERE id IN (SELECT id FROM another_dist_table)executes the subquery on each shard individually, potentially producing incorrect results. - Using JOIN with a distributed subquery without GLOBAL -- Similarly,
SELECT ... FROM dist_table JOIN another_dist_table ON ...can produce wrong results because each shard only sees its local data in the subquery. - The setting
distributed_product_modeis set to 'deny' -- This is the default and safest setting, which blocks these queries outright rather than silently producing wrong results. - ORM or query builder generating standard SQL -- Automated query generators may not be aware of ClickHouse's GLOBAL keyword requirement.
Troubleshooting and Resolution Steps
Add the GLOBAL keyword to your subquery. This is the recommended fix. GLOBAL causes the subquery to execute once on the initiator node, and the result is then broadcast to all shards:
-- Instead of this (will fail): SELECT * FROM distributed_orders WHERE user_id IN (SELECT user_id FROM distributed_users WHERE country = 'US'); -- Use this: SELECT * FROM distributed_orders WHERE user_id GLOBAL IN (SELECT user_id FROM distributed_users WHERE country = 'US');For JOIN queries, apply the same pattern:
-- Instead of this: SELECT o.*, u.name FROM distributed_orders o JOIN distributed_users u ON o.user_id = u.user_id; -- Use this: SELECT o.*, u.name FROM distributed_orders o GLOBAL JOIN distributed_users u ON o.user_id = u.user_id;Alternatively, change the
distributed_product_modesetting if you want ClickHouse to handle this automatically:-- Automatically rewrite IN/JOIN to GLOBAL IN/GLOBAL JOIN: SET distributed_product_mode = 'global'; -- Or allow local execution (use with caution -- may produce wrong results): SET distributed_product_mode = 'local';The
globaloption is generally safe; thelocaloption should only be used when you are certain the subquery data exists identically on every shard.Consider materializing the subquery result into a temporary table if the GLOBAL subquery result set is very large:
CREATE TEMPORARY TABLE tmp_users AS SELECT user_id FROM distributed_users WHERE country = 'US'; SELECT * FROM distributed_orders WHERE user_id IN (SELECT user_id FROM tmp_users);Review your application's query generation layer to ensure it adds GLOBAL where appropriate when targeting distributed tables.
Best Practices
- Always use
GLOBAL INandGLOBAL JOINwhen both the outer query and subquery reference distributed tables. - Be mindful that GLOBAL subqueries materialize their result on the initiator node and broadcast it to shards -- keep the subquery result set reasonably small to avoid excessive memory usage and network transfer.
- Set
distributed_product_mode = 'global'at the user or profile level if your application frequently constructs these queries and you want automatic handling. - For very large subquery results, consider using dictionaries or pre-joined tables instead of GLOBAL JOIN.
- Test distributed queries against multi-shard clusters during development -- errors like this only manifest when data is actually distributed.
Frequently Asked Questions
Q: Why does ClickHouse deny IN/JOIN subqueries on distributed tables by default?
A: Without the GLOBAL keyword, ClickHouse sends the full query (including the subquery) to each shard. Each shard then executes the subquery against its own local data only. If the subquery references a distributed table, each shard sees a different subset of the data, leading to incorrect or incomplete results. The deny mode prevents this silently wrong behavior.
Q: What is the performance impact of GLOBAL IN/JOIN?
A: The subquery executes once on the initiator node, and the full result set is sent to every shard. If the result set is small (thousands to low millions of rows), this is efficient. If it is very large, it can consume significant memory on the initiator and network bandwidth to each shard.
Q: Can I use GLOBAL with non-distributed tables in the subquery?
A: You can, but it is unnecessary. If the subquery references a local (non-distributed) table, there is no ambiguity about which data it accesses, so GLOBAL has no effect. The issue only arises when the subquery targets a distributed table.
Q: What does distributed_product_mode = 'local' actually do?
A: It allows the subquery to execute locally on each shard without the GLOBAL keyword. This is only correct if the subquery table's data is fully replicated on every shard (e.g., a small lookup table). If the data is sharded, the results will be wrong.
Q: Is there a way to detect these issues before they hit production?
A: Yes. Run your query suite against a test cluster with at least two shards and verify the results. Queries that work correctly on a single node may produce wrong results or errors on a multi-shard setup.