The "DB::Exception: Set size limit exceeded" error in ClickHouse occurs when an IN clause or a Set data structure grows beyond the maximum allowed size. The SET_SIZE_LIMIT_EXCEEDED error code is triggered when the number of elements or total bytes in a set exceeds max_rows_in_set or max_bytes_in_set, respectively.
Impact
The query is aborted when the set being built exceeds the configured limit. This commonly affects queries with large IN clauses, subqueries used with IN that return too many rows, and global IN operations in distributed queries. Applications that rely on dynamic IN-list filtering may fail unexpectedly when the filter set grows beyond the threshold.
Common Causes
- A subquery used with
INreturns more rows thanmax_rows_in_setallows GLOBAL INoperations that transfer large sets across distributed nodes- Large explicit IN lists embedded in the query text
- Queries that use
IN (SELECT ...)against high-cardinality columns without proper filtering - User profile settings that impose restrictive set size limits
Troubleshooting and Resolution Steps
Check the current set size limits:
SELECT name, value FROM system.settings WHERE name IN ('max_rows_in_set', 'max_bytes_in_set', 'set_overflow_mode');Increase the limit if the set size is justified:
SET max_rows_in_set = 10000000; SET max_bytes_in_set = 1000000000; SELECT * FROM events WHERE user_id IN (SELECT user_id FROM active_users);Filter the subquery to reduce the number of elements in the set:
-- Instead of: SELECT * FROM events WHERE user_id IN (SELECT user_id FROM users); -- Add filters to the subquery: SELECT * FROM events WHERE user_id IN ( SELECT user_id FROM users WHERE created_at >= '2025-01-01' );Use a JOIN instead of IN for large sets, which may be more memory-efficient:
SELECT e.* FROM events e INNER JOIN active_users a ON e.user_id = a.user_id;Change the overflow mode to silently truncate the set instead of throwing an error:
SET set_overflow_mode = 'break'; SELECT * FROM events WHERE user_id IN (SELECT user_id FROM users);With
break, the set is capped at the limit and the query proceeds with an incomplete set. Use this with caution as it may produce incorrect results.For distributed queries using GLOBAL IN, consider whether the set can be precomputed on a smaller subset:
-- Create a temporary table with the filter set CREATE TEMPORARY TABLE filter_ids AS SELECT DISTINCT user_id FROM users WHERE region = 'US'; -- Use it in the distributed query SELECT * FROM events_distributed WHERE user_id GLOBAL IN (SELECT user_id FROM filter_ids);
Best Practices
- Filter subqueries used with IN as aggressively as possible to keep set sizes small.
- Prefer JOINs over IN for large filter sets, as JOINs can use more memory-efficient algorithms.
- Set
max_rows_in_setandmax_bytes_in_setat the user profile level to prevent accidentally large sets. - Avoid using
GLOBAL INwith unfiltered subqueries in distributed setups, as the entire set must be transferred to every shard. - When using
set_overflow_mode = 'break', be aware that results may be incomplete.
Frequently Asked Questions
Q: What are the default values for set size limits?
A: By default, max_rows_in_set and max_bytes_in_set are set to 0 (unlimited). The limits must be explicitly configured to trigger this error.
Q: What is the difference between IN and GLOBAL IN regarding set limits?
A: With IN, the subquery is executed on each shard independently. With GLOBAL IN, the subquery is executed once on the initiator node and the resulting set is distributed to all shards. GLOBAL IN is more susceptible to set size limits because the entire set must fit in memory on the initiator and be transferred over the network.
Q: Does set_overflow_mode = 'break' guarantee correct results?
A: No. When the set is truncated, some values that should match will be excluded from the filter. This means the query may return fewer rows than expected. Only use this mode when approximate results are acceptable.
Q: Can I use a dictionary instead of IN for large filter sets?
A: Yes. For stable filter sets, loading them into a ClickHouse dictionary and using dictHas() can be more efficient than building a set at query time, especially for frequently reused filter sets.