ClickHouse GROUP BY: Syntax, Memory Optimization, and Alternatives

Q: How do I let a GROUP BY spill to disk?

Set max_bytes_before_external_group_by to a byte budget (e.g. '8G' ); it defaults to 0/disabled. ClickHouse then dumps buckets to disk and merges them one at a time. It is correct but slower, so use it only after exhausting query-level optimizations.

Q: Does optimize_aggregation_in_order always reduce memory?

Only when the GROUP BY key is a prefix of the table's ORDER BY key, or an injective function of such a prefix. Then ClickHouse finalizes groups as it streams sorted data instead of holding them all in memory. It can be slower than the default hash path, so benchmark both.

Q: When should I use DISTINCT or LIMIT 1 BY instead of GROUP BY?

Use DISTINCT for plain deduplication (it can stream and stop early under LIMIT ), and LIMIT 1 BY key for one row per key when the data is already sorted appropriately. Reach for GROUP BY when you actually need aggregate functions over each group.

GROUP BY is the workhorse of analytical queries in ClickHouse, and it is also the single most common cause of out-of-memory errors. ClickHouse aggregates in RAM using hash tables, so the memory footprint of a GROUP BY is driven almost entirely by the number of distinct grouping keys and the cost of the aggregate-function states it has to keep for each one.

This guide covers the GROUP BY syntax and modifiers, explains how aggregation works internally, and then focuses on the settings and query rewrites that keep large aggregations within their memory budget. It closes with the cheaper alternatives — DISTINCT, LIMIT BY, and window functions — that often replace GROUP BY entirely.

Syntax and Modifiers

The grouping key is a list of expressions. Every column in SELECT, HAVING, and ORDER BY must appear either in the grouping key or inside an aggregate function.

SELECT
    country,
    toStartOfDay(event_time) AS day,
    count() AS events,
    uniq(user_id) AS users
FROM events
GROUP BY country, day
ORDER BY day, country;

A few ClickHouse-specific behaviors are worth knowing:

Empty GROUP BY (no clause at all, only aggregates in SELECT) returns exactly one row.
NULL handling differs from standard SQL. ClickHouse treats NULL as a value and considers NULL == NULL when grouping, so NULLs collapse into a single group rather than being dropped.

Subtotal Modifiers

ClickHouse supports the standard rollup/cube modifiers for hierarchical reporting:

Modifier	What it produces
`WITH ROLLUP`	Subtotals for each prefix of the key list (e.g. `(a,b)`, `(a)`, `()`), right-to-left, plus a grand total
`WITH CUBE`	Subtotals for every combination (power set) of the grouping keys
`WITH TOTALS`	One extra row with aggregates computed across all rows, returned as a separate totals block

SELECT country, device, count()
FROM events
GROUP BY country, device WITH ROLLUP
ORDER BY country, device;

In rollup/cube subtotal rows, the omitted key columns are filled with the column's default value (0, empty string, etc.). Use the `GROUPING()` function to distinguish a genuine subtotal row from a real value that happens to equal the default. You can also specify key sets explicitly with GROUPING SETS.

How GROUP BY Uses Memory

ClickHouse aggregates with non-blocking hash tables: each processing thread builds its own hash table, then the per-thread tables are merged at the end. This avoids lock contention but means the same key may exist in several thread-local tables before the final merge, which is why memory scales with max_threads as well as cardinality.

Two factors dominate the footprint:

The number and size of distinct keys. Each unique combination of grouping-key values occupies a hash-table slot. Wide string keys cost far more than fixed-width integer keys.
The aggregate-function states. Most functions (count, sum, avg) keep a tiny fixed-size state. But groupArray, uniqExact, quantileExact, sequenceMatch, and the *Map variants keep state that grows with the data, and they are the usual culprits behind a memory blow-up.

Two-Level Aggregation

When the number of keys gets large, ClickHouse automatically switches to two-level aggregation: it splits the hash table into 256 buckets (using one byte of the key hash). This parallelizes the final merge across threads and is the prerequisite for spilling to disk. It kicks in when either threshold is crossed:

Setting	Default	Meaning
`group_by_two_level_threshold`	`100000`	Switch to two-level once distinct keys exceed this count
`group_by_two_level_threshold_bytes`	`50000000`	Switch once the aggregation state exceeds ~50 MB

Setting either to 0 disables that trigger. To force two-level immediately (useful when you already know cardinality is high), set group_by_two_level_threshold = 1.

Memory Optimization Techniques

The following techniques are ordered roughly from "rewrite the query" (cheap, large wins) to "spill to disk" (last resort).

1. Filter Before You Aggregate

The cheapest aggregation is one over fewer rows. Push selective WHERE conditions — especially those that hit the primary key or a data-skipping index — before the GROUP BY. Pre-filtering with a subquery on a selective dimension routinely cuts aggregation memory by several-fold.

2. Shrink the Grouping Key

Key width directly drives hash-table size. Group by fixed-width types instead of strings where you can:

-- Instead of grouping by a wide String key, group by a compact integer
SELECT cityHash64(user_id) AS uid_hash, count()
FROM events
GROUP BY uid_hash;

Casting a string key to UInt32/UInt64 or FixedString(N) measurably reduces memory. One caveat: grouping by two or more LowCardinality columns at once forces ClickHouse onto the slow serialized aggregation method — in that case converting them to FixedString is often faster and lighter.

3. Exploit Sort Order with optimize_aggregation_in_order

If the GROUP BY key is a prefix of the table's ORDER BY (sort) key, or is an injective function of such a prefix, ClickHouse can aggregate in order, finalizing each group as it goes instead of holding every group in memory at once:

SELECT toStartOfHour(timestamp) AS hour, avg(value)
FROM metrics
GROUP BY hour
SETTINGS optimize_aggregation_in_order = 1;

This can reduce peak memory dramatically (the Altinity KB shows a drop from ~969 MiB to ~169 MiB on a matching key). It is sometimes slower than the hash-based path because it gives up some parallelism, so benchmark both.

4. Lower max_threads

Because each thread keeps its own hash table, halving the thread count roughly halves the peak aggregation memory — at the cost of proportionally longer runtime. A reasonable knob when a query is close to the limit:

SELECT key, sum(value) FROM t GROUP BY key
SETTINGS max_threads = 2;

5. Split High-Cardinality Work with UNION ALL or Hashing

Splitting one huge aggregation into independent slices keeps each slice's state small:

-- Process the key space in shards, then combine
SELECT user_id, count() FROM events WHERE cityHash64(user_id) % 4 = 0 GROUP BY user_id
UNION ALL
SELECT user_id, count() FROM events WHERE cityHash64(user_id) % 4 = 1 GROUP BY user_id
-- ... shards 2 and 3

On a cluster, distributing data by the grouping key (so each node owns a disjoint key range) achieves the same effect at insert time and enables distributed_aggregation_memory_efficient = 1 to reduce memory on the initiator.

6. Approximate Instead of Exact

If you are aggregating to count uniques or compute quantiles, an approximate function uses a fraction of the memory of its exact counterpart. Replace uniqExact with uniq/uniqCombined and quantileExact with quantile. See ClickHouse approximate aggregation methods and count(DISTINCT) for the full menu and accuracy trade-offs.

7. Spill to Disk (Last Resort)

When a query genuinely needs more state than fits in RAM, allow external aggregation:

SELECT key, groupArray(value) FROM t GROUP BY key
SETTINGS max_bytes_before_external_group_by = '8G';

ClickHouse dumps two-level buckets to disk and merges them one bucket (1/256 of the state) at a time, so you only need RAM for a single bucket plus overhead. As an alternative on shared clusters, max_bytes_ratio_before_external_group_by sets the threshold as a fraction of available memory. External aggregation is correct but slow — treat it as a safety net, not a default. The default for max_bytes_before_external_group_by is 0 (disabled).

GROUP BY Alternatives

GROUP BY is not always the right tool. For deduplication and "latest row per key" patterns, lighter clauses exist.

Pattern	Use this	Why
Distinct rows / values, often with `LIMIT`	`DISTINCT`	Streams results and can stop early under `LIMIT`; no aggregate state
First / N rows per key	`LIMIT n BY key`	Streams per-key; cheapest when sort order already deduplicates
Latest row per key (by version/timestamp)	`argMax(col, version)` in a `GROUP BY`, or `LIMIT 1 BY`	`argMax` is exact but memory-heavy; `LIMIT 1 BY` is cheap if rows are pre-sorted
Per-row rankings / running totals without collapsing rows	Window functions	Keeps every input row; `GROUP BY` would collapse them

A common rewrite — "one row per user, keeping the newest" — illustrates the choice:

-- GROUP BY with argMax: exact, but holds state for every user
SELECT user_id, argMax(status, updated_at) AS status
FROM events
GROUP BY user_id;

-- LIMIT 1 BY: streams, very cheap when the table/order already sorts newest-first
SELECT user_id, status
FROM events
ORDER BY user_id, updated_at DESC
LIMIT 1 BY user_id;

If you only need to collapse exact-duplicate rows, prefer DISTINCT (or LIMIT 1 BY for per-key), both of which can stream and terminate early — whereas GROUP BY must finish merging all states before returning.

For repeated dashboard aggregations that always run the same GROUP BY, the best optimization is to not recompute at all: pre-aggregate with a projection or materialized view. See ClickHouse projections.

Common Issues

Memory limit (for query) exceeded on a high-cardinality key. Almost always too many distinct keys or a heavy aggregate state. Shrink the key, filter first, lower max_threads, or enable external aggregation. See memory limit exceeded during aggregation.
groupArray/uniqExact/quantileExact dominating memory. Switch to approximate variants or cap the collected array size.
Two LowCardinality columns in the key causing a slowdown. Cast to FixedString to avoid the serialized aggregation method.
optimize_aggregation_in_order not helping. The GROUP BY key must be a prefix of the table ORDER BY key, or an injective function of such a prefix; otherwise ClickHouse falls back to the hash path.
NULLs collapsing unexpectedly. ClickHouse groups all NULLs together; if you need SQL-standard behavior, handle NULLs explicitly.

Best Practices

Filter before grouping. A selective WHERE on the primary key is the highest-leverage optimization available.
Keep grouping keys narrow and fixed-width. Hash integers, not raw strings.
Align hot GROUP BY keys with the table sort order so optimize_aggregation_in_order can apply.
Use approximate functions when exact uniqueness/quantiles are not required.
Pre-aggregate repeated queries with projections or materialized views instead of grouping the raw table every time.
Reserve external aggregation for genuine spill scenarios — fix the query shape first.
Check whether you need GROUP BY at all — DISTINCT, LIMIT 1 BY, or window functions are often cheaper.

How Pulse Helps

Diagnosing a GROUP BY that runs out of memory means correlating the failing query with its actual peak memory, the cardinality of its keys, and the aggregate functions involved — data that is scattered across system.query_log, system.processes, and the server's memory tracker. Pulse surfaces the queries that consume the most aggregation memory on your cluster, flags high-cardinality GROUP BY patterns and heavy aggregate states, and points to the relevant settings (group_by_two_level_threshold, optimize_aggregation_in_order, max_bytes_before_external_group_by) so you can tune the specific query rather than raising global limits. For broader memory triage, see diagnosing ClickHouse memory usage.

Frequently Asked Questions

Q: Why does my GROUP BY use so much memory?

ClickHouse aggregates in RAM, and memory scales with the number of distinct grouping keys, the width of those keys, and the size of each aggregate-function state. High cardinality plus a heavy function like uniqExact or groupArray is the usual cause. Filter first, shrink the key, or switch to approximate functions.

Q: What is two-level aggregation and when does it trigger?

It splits the aggregation hash table into 256 buckets so the final merge can run in parallel and spill to disk. It activates automatically once distinct keys exceed group_by_two_level_threshold (default 100000) or the state exceeds group_by_two_level_threshold_bytes (default ~50 MB).

Q: How do I let a GROUP BY spill to disk?

Set max_bytes_before_external_group_by to a byte budget (e.g. '8G'); it defaults to 0/disabled. ClickHouse then dumps buckets to disk and merges them one at a time. It is correct but slower, so use it only after exhausting query-level optimizations.

Q: Does optimize_aggregation_in_order always reduce memory?

Only when the GROUP BY key is a prefix of the table's ORDER BY key, or an injective function of such a prefix. Then ClickHouse finalizes groups as it streams sorted data instead of holding them all in memory. It can be slower than the default hash path, so benchmark both.

Q: When should I use DISTINCT or LIMIT 1 BY instead of GROUP BY?

Use DISTINCT for plain deduplication (it can stream and stop early under LIMIT), and LIMIT 1 BY key for one row per key when the data is already sorted appropriately. Reach for GROUP BY when you actually need aggregate functions over each group.

Q: How can I make a repeated dashboard GROUP BY cheaper?

Pre-aggregate it. A projection or materialized view computes the grouped result at insert time so the dashboard reads pre-aggregated rows instead of re-scanning and re-grouping the raw table. See ClickHouse projections.