ClickHouse stores arrays as first-class columns and offers a rich set of functions to process them in place. Most of these functions are cheap, but a few patterns — particularly arrayMap with index-based access and arrayJoin expansion — can consume tens of gigabytes of RAM on data that occupies only a few hundred megabytes on disk. The cause is almost always a hidden row-multiplication step that duplicates columns across every array element.
This guide explains why array functions blow up memory, how block size amplifies the effect, and how to rewrite queries so they run in a fraction of the RAM. For the semantics of the ARRAY JOIN clause itself, see the dedicated ClickHouse ARRAY JOIN guide.
Why Array Functions Use So Much Memory
The key insight is that several "in-place" array functions are not actually in-place under the hood. To evaluate a higher-order function like arrayMap, ClickHouse may temporarily perform an arrayJoin-like operation: it unnests the array into rows, evaluates the lambda, then folds the result back into an array. While the array is unnested, every column participating in the expression is duplicated once per array element.
Memory consumption therefore scales roughly with:
column_count × array_length × block_size
This is why a query over a modestly sized table can require gigabytes of memory: with the default block of tens of thousands of rows and arrays of a hundred elements each, a single block expands to millions of intermediate rows, and each participating column is materialized for every one of them.
The same mechanism underlies the arrayJoin() function and the ARRAY JOIN clause: both physically expand one row into many. The difference is that arrayJoin/ARRAY JOIN are meant to produce element-level rows, whereas an accidental expansion inside arrayMap produces those rows only to immediately collapse them again — paying the full memory cost for nothing.
The Block Size Multiplier
ClickHouse processes queries in blocks. The max_block_size setting (default 65536) controls how many rows are read into a block before processing. Because array expansion happens per block, peak memory is proportional to max_block_size:
- A block of
65536rows, each holding an array of 100 elements, expands to roughly 6.5 million intermediate rows. - Halving
max_block_sizeroughly halves the peak memory of the expansion step.
This makes max_block_size the single most effective lever when an array query is close to a memory limit but you cannot rewrite it:
SELECT count()
FROM large_table
ARRAY JOIN big_array AS elem
WHERE elem > 0
SETTINGS max_block_size = 8192;
Reducing max_block_size from the default 65536 to 8192 (roughly an 8x reduction) cuts the expansion memory by approximately the same factor, at the cost of some throughput because there are more, smaller blocks to schedule. Note that the arrayJoin() function historically did not always split its output by max_block_size — a single input chunk produced a single (potentially huge) output chunk — so the most reliable control over expansion size is reducing the input block size, not relying on output splitting.
The Most Common Memory Trap: Index-Based arrayMap
The single most expensive — and most avoidable — array pattern is using arrayMap with positional index access. Suppose you want to multiply two parallel arrays element-wise. The naive approach indexes into both arrays using arrayEnumerate to generate positions:
-- Anti-pattern: index-based access forces a large internal expansion
SELECT arrayMap(x -> array_1[x] * array_2[x], arrayEnumerate(array_1))
FROM big_arrays;
Here the lambda receives an index x and then dereferences array_1[x] and array_2[x]. To evaluate this, ClickHouse must keep both full arrays available for every index, which triggers the large internal expansion described above. On real data this pattern has been measured at multiple gigabytes of memory and tens of seconds of runtime.
The fix is to pass the arrays directly to a multi-argument lambda so ClickHouse iterates element-wise without index dereferencing:
-- Correct: pass arrays directly; the lambda receives elements, not indexes
SELECT arrayMap((x, y) -> x * y, array_1, array_2)
FROM big_arrays;
The rewritten query is functionally identical but runs with a small fraction of the memory and has been measured at roughly 100x faster on the same data. The rule is simple: never index into an array from inside a lambda when you can pass that array as a lambda argument instead. Every higher-order array function (arrayMap, arrayFilter, arrayCount, arrayExists, arraySum) accepts multiple parallel array arguments for exactly this reason.
Index-based vs. direct-argument arrayMap
| Aspect | arrayMap(x -> a[x] * b[x], arrayEnumerate(a)) |
arrayMap((x, y) -> x * y, a, b) |
|---|---|---|
| Lambda receives | Position index | Array elements directly |
| Internal expansion | Large — both arrays kept per index | Minimal — element-wise iteration |
| Typical memory | Gigabytes on large arrays | Hundreds of MB |
| Relative speed | Baseline | ~100x faster |
| When forced | Only when you genuinely need cross-position logic | Default for element-wise work |
When You Actually Need Expansion: Prefer Array Functions Over ARRAY JOIN
A large class of queries use ARRAY JOIN or arrayJoin() simply to test, count, or filter array contents — then aggregate the result back down. Every one of those expansions can usually be replaced with an in-place array function that never multiplies rows:
| Goal | Avoid (row expansion) | Use instead (in-place) |
|---|---|---|
| Does any element match? | ARRAY JOIN + WHERE |
arrayExists(x -> cond, arr) |
| How many elements match? | ARRAY JOIN + count() |
arrayCount(x -> cond, arr) |
| Sum / aggregate elements | ARRAY JOIN + sum() |
arraySum(arr) / arrayReduce('sum', arr) |
| Keep matching elements | ARRAY JOIN + re-groupArray |
arrayFilter(x -> cond, arr) |
| Is a value present? | ARRAY JOIN + WHERE x = v |
has(arr, v) |
For example, counting rows where any tag exceeds a threshold does not need expansion:
-- Expansion: multiplies rows, then collapses with count
SELECT count()
FROM events
ARRAY JOIN scores AS s
WHERE s > 90;
-- In-place: no row multiplication, far less memory
SELECT count()
FROM events
WHERE arrayExists(x -> x > 90, scores);
Reserve ARRAY JOIN and arrayJoin() for cases where you genuinely need element-level rows in the output — joining elements against other tables, grouping by element value, or returning one row per element to the client. See ClickHouse ARRAY JOIN for the full semantics and the array functions vs. window functions guide for choosing between approaches.
Reduce Rows Before You Expand
When expansion is unavoidable, shrink the input first so fewer rows enter the expansion step:
- Filter on non-array columns with
PREWHEREorWHEREbeforeARRAY JOIN. The optimizer can push predicates that reference only non-array columns above the expansion, so they cut the row count before it multiplies. See PREWHERE optimization. - Aggregate immediately after expansion. A
GROUP BYdirectly followingARRAY JOINlets ClickHouse aggregate incrementally instead of materializing the full expanded set. Building large arrays back up with groupArray only to re-expand them is a common round-trip to avoid. - Avoid stacked
arrayJoin()calls. TwoarrayJoin()functions in the sameSELECTproduce a Cartesian product of rows — a frequent cause of accidental result-set (and memory) explosion. Use a singleARRAY JOINclause with multiple comma-separated arrays for element-wise (zip) expansion instead.
Diagnosing Array-Function Memory
When a query trips a memory limit, identify whether array expansion is the cause. Unnest ProfileEvents from system.query_log to see how many rows were read versus produced:
SELECT
memory_usage,
PE.keys AS metric,
PE.values AS value
FROM system.query_log
ARRAY JOIN ProfileEvents AS PE
WHERE query_id = 'your-query-id'
AND type = 2 -- QueryFinish
AND event_date = today()
AND PE.keys IN ('SelectedRows')
ORDER BY value DESC;
A memory_usage far larger than the on-disk size of the data scanned, combined with a large gap between rows read and result rows, is the signature of array expansion. For a server-wide view of where memory is going, see ClickHouse memory usage diagnosis; for the specific MEMORY_LIMIT_EXCEEDED error, see memory limit exceeded.
Best Practices
- Pass arrays as lambda arguments, never index into them.
arrayMap((x, y) -> x * y, a, b)instead ofarrayMap(x -> a[x] * b[x], arrayEnumerate(a)). This is the highest-impact single change. - Use in-place array functions when you only need an aggregate.
arrayExists,arrayCount,arraySum, andarrayFilternever multiply rows. - Reserve
ARRAY JOINfor true element-level output. Joining elements against tables, grouping by element value, or returning per-element rows. - Lower
max_block_sizewhen a query is near a memory limit and cannot be rewritten. Peak expansion memory scales linearly with block size. - Filter early. Push non-array predicates above the expansion with
PREWHERE/WHERE. - Never stack
arrayJoin()calls in oneSELECTunless you intend a Cartesian product.
Common Issues
MEMORY_LIMIT_EXCEEDEDon a small table. Almost always index-basedarrayMapor an unfilteredARRAY JOINon long arrays. Rewrite to direct lambda arguments or reducemax_block_size.- Query is correct but extremely slow. Index dereferencing inside lambdas is both memory- and CPU-heavy; the direct-argument form is dramatically faster.
- OOM only on certain partitions. Array length is data-dependent; partitions with long arrays expand more. Filter to the offending partitions and inspect array lengths with
length(arr). Sizes of arrays don't matchwhen using multiple arrays. Parallel-array lambdas and multi-arrayARRAY JOINrequire equal lengths per row. See sizes of arrays don't match.
How Pulse Helps
Pulse monitors ClickHouse query performance and memory consumption, and flags queries where array processing is the dominant cost. When an arrayMap or arrayJoin query spikes memory, Pulse surfaces the query, its expansion ratio (rows read versus rows produced), and its peak memory_usage — making it straightforward to spot index-based lambda anti-patterns, unfiltered ARRAY JOINs, and queries that would benefit from a lower max_block_size or an in-place array-function rewrite, before they cause an out-of-memory failure in production.
Frequently Asked Questions
Q: Why does arrayMap use so much more memory than the array's size on disk?
A: To evaluate the lambda, ClickHouse can internally unnest the array into rows and duplicate every participating column per element. Peak memory scales with column_count × array_length × block_size, so a few hundred MB on disk can require gigabytes during processing — especially with index-based access like arr[x] inside the lambda.
Q: How do I make arrayMap use less memory?
A: Pass arrays directly as lambda arguments instead of indexing into them. Rewrite arrayMap(x -> a[x] * b[x], arrayEnumerate(a)) as arrayMap((x, y) -> x * y, a, b). This avoids the large internal expansion and has been measured at roughly 100x faster with a fraction of the memory.
Q: Does arrayJoin respect max_block_size?
A: Not for output splitting in older versions — a single input chunk could produce one very large output chunk regardless of max_block_size. The reliable lever is reducing the input max_block_size so fewer rows enter the expansion, which lowers peak memory proportionally.
Q: When should I use ARRAY JOIN versus an array function?
A: Use array functions (arrayExists, arrayCount, arrayFilter, arraySum) when you only need an aggregate or transformation of the array contents — they never multiply rows. Use ARRAY JOIN or arrayJoin() only when you genuinely need element-level rows in the output, such as grouping by element value or joining elements against other tables.
Q: Why do two arrayJoin() calls in one query explode my row count?
A: Multiple arrayJoin() function calls in the same SELECT produce a Cartesian product of their elements, not an element-wise (zip) expansion. To expand parallel arrays together, use a single ARRAY JOIN clause with comma-separated arrays instead.
Q: How do I confirm array expansion is the memory cause?
A: Query system.query_log and compare the memory_usage column and SelectedRows ProfileEvent against the result row count. A memory figure far above the scanned data size, plus a large gap between rows read and rows produced, points to array expansion.