ClickHouse Memory Limit Exceeded on GROUP BY and ORDER BY

DB::Exception: Memory limit (for query) exceeded: would use 12.3 GiB
(attempt to allocate chunk of 1073741824 bytes), maximum: 10.00 GiB.
(MEMORY_LIMIT_EXCEEDED, Code: 241)

This error fires when a GROUP BY or ORDER BY query builds up more intermediate state than max_memory_usage allows (default: 10 GB per query). Unlike streaming operators such as simple filters or projections, aggregation and sorting are blocking — ClickHouse must accumulate all relevant rows in memory before producing output. This article focuses on why aggregation and sort operations are the dominant source of Code 241 failures, and how to resolve them without simply raising the memory limit.

What This Error Means

ClickHouse tracks every allocation through a hierarchy of MemoryTracker objects. One tracker exists per query, one per user, and one for the whole server. Whichever fires first produces the (for query), (for user), or (total) qualifier in the error message.

For GROUP BY, ClickHouse builds an in-memory hash table keyed on the grouping columns. The table grows with the number of distinct key combinations and the size of the aggregate state for each key. Aggregate functions differ dramatically in state size:

  • sum, count, min, max, avg — a few bytes per key
  • groupArray with bounded size — proportional to collected elements
  • uniqExact — stores every distinct raw value or hash, growing linearly with cardinality

A GROUP BY user_id over a billion-event table with uniqExact(session_id) as an aggregate can require tens of gigabytes. The default 10 GB limit is frequently insufficient.

For ORDER BY, ClickHouse materializes the full result set in a sort buffer before outputting any row. An ORDER BY event_time DESC with no LIMIT on a wide table returns all rows sorted — the entire result set must fit in memory simultaneously.

By default, neither spill-to-disk mode is active. When the hash table or sort buffer exceeds max_memory_usage, ClickHouse raises MEMORY_LIMIT_EXCEEDED rather than risk an OS-level OOM that would kill the entire server process.

Common Causes

  1. High-cardinality GROUP BY with memory-intensive aggregate functions. Grouping by a column with millions of distinct values (user IDs, session IDs, UUIDs) creates millions of hash table entries. Aggregate functions such as uniqExact, groupArray, or topK store state proportional to the number of distinct values per group, amplifying memory consumption.

  2. ORDER BY without LIMIT. Any query that sorts and returns all rows must buffer the complete result set. On tables with billions of rows or wide schemas, the sort buffer alone can exceed tens of gigabytes.

  3. Distributed query aggregation on the coordinator. In a distributed GROUP BY, each shard performs local partial aggregation and sends its partial states to the coordinator (initiator node). The coordinator must hold and merge all partial states simultaneously. If shards have low overlap in group keys, the coordinator's hash table grows to the union of all partial states — far larger than any single shard's local result.

  4. Complex queries with multiple aggregation stages. Subqueries, HAVING clauses with secondary aggregations, or queries with both GROUP BY and ORDER BY compound memory requirements: each blocking operator has its own buffer.

  5. Aggregate functions applied to large strings or arrays. Aggregating over columns that store large strings (URLs, log lines) or arrays multiplies per-entry memory overhead in the hash table.

  6. Concurrent queries compressing the available per-query budget. max_memory_usage is a per-query ceiling, but max_memory_usage_for_user and max_server_memory_usage are shared ceilings. If multiple heavy aggregations run simultaneously, the server tracker fires even when each individual query is within its per-query limit.

How to Fix

1. Enable external aggregation (spill GROUP BY to disk)

External aggregation writes partial hash table state to disk when a threshold is reached, then merges from disk at the end. This prevents the error at the cost of query speed.

Important sizing rule: the merge phase (stage 2) requires nearly as much memory as the read phase (stage 1). Always set max_memory_usage to at least twice max_bytes_before_external_group_by.

-- Enable external aggregation for a specific query
SELECT user_id, count(), sum(revenue)
FROM events
GROUP BY user_id
SETTINGS
    max_bytes_before_external_group_by = 10000000000,  -- 10 GB spill threshold
    max_memory_usage = 20000000000;                     -- must be ~2x the threshold above

For ClickHouse 24.12 and later, use the ratio-based variant which is simpler to maintain:

-- Ratio-based external aggregation (ClickHouse 24.12+)
SELECT user_id, count(), sum(revenue)
FROM events
GROUP BY user_id
SETTINGS
    max_bytes_ratio_before_external_group_by = 0.5;  -- spill at 50% of max_memory_usage

As of ClickHouse 26.5, max_bytes_ratio_before_external_group_by defaults to 0.5, meaning external aggregation is active by default in new installations. In versions 24.12 through 26.4 the default was 0 (disabled), and before 24.12 the ratio setting did not exist.

Temporary spill files are written to /var/lib/clickhouse/tmp/ (controlled by tmp_path in server config). Ensure sufficient disk space is available — spill volume can be several times the in-memory hash table size due to serialization overhead.

When external aggregation is active, expect query runtime to increase by approximately 3x compared to a pure in-memory execution. Prefer SSD-backed tmp_path to minimize the penalty.

2. Enable external sort (spill ORDER BY to disk)

The same spill-to-disk mechanism applies to ORDER BY:

-- External sort for large ORDER BY
SELECT *
FROM large_table
ORDER BY event_time DESC
SETTINGS
    max_bytes_before_external_sort = 10000000000,
    max_memory_usage = 20000000000;

Or using the ratio variant (ClickHouse 24.12+):

SELECT *
FROM large_table
ORDER BY event_time DESC
SETTINGS
    max_bytes_ratio_before_external_sort = 0.5;

ClickHouse writes sorted blocks to disk, then merges them in a final pass to produce the sorted output. Always combine ORDER BY with a LIMIT where possible — using a heap instead of a full sort is dramatically cheaper:

-- Much cheaper: sort buffer is bounded by LIMIT size
SELECT * FROM large_table ORDER BY event_time DESC LIMIT 1000;

3. Reduce GROUP BY cardinality with approximate functions

Many analytical use cases do not require exact answers. Replacing exact aggregate functions with approximations dramatically lowers per-key state size:

-- Exact: stores every distinct value hash, high memory
SELECT user_id, uniqExact(session_id) AS sessions FROM events GROUP BY user_id;

-- Approximate: uses fixed-size HyperLogLog sketch regardless of cardinality
SELECT user_id, uniqCombined(session_id) AS approx_sessions FROM events GROUP BY user_id;

uniqCombined uses a HyperLogLog sketch with ~3% error. uniqHLL12 offers a smaller fixed-size approximation. For percentile aggregations, use quantile (sampling-based) instead of quantileExact.

4. Pre-aggregate with AggregatingMergeTree

For recurring high-cardinality aggregations, moving the work from query time to insert/merge time via AggregatingMergeTree eliminates the per-query memory spike entirely.

Define the pre-aggregated table:

CREATE TABLE daily_user_stats
(
    date             Date,
    user_id          UInt64,
    total_revenue    AggregateFunction(sum, Float64),
    unique_sessions  AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id);

Populate via a materialized view:

CREATE MATERIALIZED VIEW daily_user_stats_mv
TO daily_user_stats
AS
SELECT
    toDate(event_time) AS date,
    user_id,
    sumState(revenue)      AS total_revenue,
    uniqState(session_id)  AS unique_sessions
FROM raw_events
GROUP BY date, user_id;

Query using -Merge suffix functions:

SELECT
    date,
    user_id,
    sumMerge(total_revenue)    AS revenue,
    uniqMerge(unique_sessions) AS sessions
FROM daily_user_stats
GROUP BY date, user_id
ORDER BY date, user_id;

At query time, the GROUP BY operates on pre-reduced aggregate states rather than raw rows, dramatically reducing the number of hash table entries and per-entry state size. Note that the query still requires GROUP BY with -Merge functions; ClickHouse does not eliminate the operator, but operating on pre-aggregated states makes it orders of magnitude cheaper.

5. Enable streaming aggregation for primary-key queries

When the GROUP BY columns form a prefix of the table's ORDER BY (primary key), optimize_aggregation_in_order allows ClickHouse to emit completed group results incrementally as data is read in order, rather than accumulating a full hash table:

-- Only effective when GROUP BY (date, user_id) matches ORDER BY prefix
SELECT date, user_id, count()
FROM events  -- assume: ORDER BY (date, user_id, ...)
GROUP BY date, user_id
SETTINGS optimize_aggregation_in_order = 1;

This setting is disabled by default because it reduces parallelism. It is only effective when the GROUP BY columns exactly match the table's sorting key prefix — enabling it on mismatched queries has no effect.

6. Optimize distributed aggregation on the coordinator

For distributed queries, distributed_aggregation_memory_efficient = 1 changes how the coordinator merges partial states from shards. Instead of loading all partial states into memory at once, the coordinator processes them in a streaming fashion:

SELECT user_id, sum(revenue)
FROM distributed_events
GROUP BY user_id
SETTINGS distributed_aggregation_memory_efficient = 1;

This reduces coordinator node memory at the cost of slightly slower merge throughput.

Root-Cause Analysis

Identify which queries are driving aggregation memory pressure and whether spilling is already occurring:

-- Top memory consumers in the past 24 hours
SELECT
    type,
    event_time,
    initial_query_id,
    formatReadableSize(memory_usage)      AS memory,
    normalizedQueryHash(query)            AS normalized_query_hash,
    query
FROM system.query_log
WHERE type != 'QueryStart'
  AND event_time >= now() - INTERVAL 1 DAY
ORDER BY memory_usage DESC
LIMIT 10;
-- Queries that spilled GROUP BY data to disk
SELECT
    query_id,
    event_time,
    formatReadableSize(memory_usage)                               AS memory,
    ProfileEvents['ExternalAggregationWritePart']                  AS spill_writes,
    formatReadableSize(ProfileEvents['ExternalAggregationCompressedBytes'])   AS spill_compressed,
    formatReadableSize(ProfileEvents['ExternalAggregationUncompressedBytes']) AS spill_uncompressed,
    query
FROM system.query_log
WHERE type = 'QueryFinish'
  AND ProfileEvents['ExternalAggregationWritePart'] > 0
  AND event_date >= today() - 1
ORDER BY ProfileEvents['ExternalAggregationCompressedBytes'] DESC
LIMIT 20;
-- Queries that spilled ORDER BY data to disk
SELECT
    query_id,
    event_time,
    formatReadableSize(memory_usage)                            AS memory,
    ProfileEvents['ExternalSortWritePart']                      AS sort_spill_writes,
    formatReadableSize(ProfileEvents['ExternalSortCompressedBytes']) AS sort_spill_compressed,
    query
FROM system.query_log
WHERE type = 'QueryFinish'
  AND ProfileEvents['ExternalSortWritePart'] > 0
  AND event_date >= today() - 1
ORDER BY ProfileEvents['ExternalSortCompressedBytes'] DESC
LIMIT 20;
-- Currently running queries sorted by peak memory (live diagnosis)
SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage)      AS memory_usage,
    formatReadableSize(peak_memory_usage) AS peak_memory,
    query
FROM system.processes
ORDER BY peak_memory_usage DESC
LIMIT 10;
-- Memory usage patterns by query shape over 7 days
SELECT
    count()                            AS nb_queries,
    user,
    formatReadableSize(max(memory_usage))         AS max_memory,
    formatReadableSize(avg(memory_usage))         AS avg_memory,
    formatReadableSize(quantile(0.97)(memory_usage)) AS p97_memory,
    normalizedQueryHash(query)         AS normalized_query_hash,
    any(query)                         AS sample_query
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time >= now() - INTERVAL 7 DAY
  AND query_kind = 'Select'
GROUP BY normalized_query_hash, user
ORDER BY max(memory_usage) DESC
LIMIT 20;

The ProfileEvents keys to watch are ExternalAggregationWritePart, ExternalAggregationMerge, ExternalAggregationCompressedBytes, ExternalSortWritePart, ExternalSortMerge, and ExternalSortCompressedBytes. A non-zero ExternalAggregationWritePart confirms that spilling is happening (or would happen if you enable it). A query failing with Code 241 on a GROUP BY with zero spill writes confirms that max_bytes_before_external_group_by is not set.

Preventive Measures

  • Enable max_bytes_before_external_group_by and max_bytes_before_external_sort cluster-wide in a default settings profile, set to half of max_memory_usage. This is a safety net — most queries will never spill, but the rare heavy one will complete instead of failing.
  • Always pair ORDER BY with a LIMIT in application queries. There are very few cases where a client needs millions of unsorted rows returned all at once.
  • Use uniqCombined or uniqHLL12 instead of uniqExact for high-cardinality distinct counts in dashboards and monitoring queries where a 1-3% approximation is acceptable.
  • Route recurring high-cardinality aggregations through an AggregatingMergeTree + materialized view pattern rather than scanning raw tables at query time.
  • Ensure /var/lib/clickhouse/tmp/ is on a fast disk with adequate capacity. A full tmp partition will cause spilling to fail mid-query, producing a different error. Monitor available space as a separate alert.
  • Track ProfileEvents['ExternalAggregationWritePart'] in system.query_log over time. Frequent spilling on the same query shapes signals that those queries should be rewritten or pre-aggregated.
  • Set max_memory_usage_for_user for each user profile rather than relying solely on the per-query limit. This prevents a single user with many concurrent queries from exhausting the server tracker.

Resolve GROUP BY and ORDER BY Memory Errors Automatically with Pulse

Pulse is an AI DBA for ClickHouse that monitors system.query_log in real time. When MEMORY_LIMIT_EXCEEDED fires on a GROUP BY or ORDER BY query, Pulse identifies whether max_bytes_before_external_group_by is absent or misconfigured, checks whether max_memory_usage is at least twice the spill threshold, pinpoints high-cardinality aggregations that would benefit from AggregatingMergeTree, and recommends the precise settings change or schema pattern to eliminate the error permanently. Start a free trial.

Frequently Asked Questions

Q: Why do I need to set max_memory_usage to twice max_bytes_before_external_group_by?
A: External aggregation has two phases. Phase 1 reads data and builds partial aggregate states, spilling to disk when the threshold is reached. Phase 2 merges those spilled files back into a final result. The merge phase requires nearly as much memory as the read phase. If max_memory_usage equals max_bytes_before_external_group_by, phase 2 will exceed the limit and the query will fail. The safe pattern is: max_bytes_before_external_group_by = X, max_memory_usage = 2X or more.

Q: Does enabling external aggregation always make queries slower?
A: Yes. When data is flushed to disk, query runtime typically increases by roughly 3x compared to a pure in-memory execution. The penalty depends on disk speed — NVMe SSDs minimize it significantly. External aggregation is a correctness backstop, not a performance optimization. If a query spills on every run, that is a signal to pre-aggregate the data with AggregatingMergeTree.

Q: max_bytes_ratio_before_external_group_by — should I use this instead of max_bytes_before_external_group_by?
A: The ratio variant (available since ClickHouse 24.12) is simpler to maintain because it scales automatically when max_memory_usage changes. Setting max_bytes_ratio_before_external_group_by = 0.5 is equivalent to setting max_bytes_before_external_group_by to half of max_memory_usage. In ClickHouse 26.5 and later this ratio defaults to 0.5, so external aggregation is on by default. For versions 24.12 through 26.4, you must set it explicitly; for older versions, use the absolute byte setting.

Q: My error says (total) instead of (for query). Does enabling external aggregation help?
A: (total) means the server-level tracker (max_server_memory_usage, default ~90% of RAM) was exhausted by the combined memory of all running queries, not just one. Enabling external aggregation for individual queries helps by reducing each query's peak allocation, which in turn reduces the aggregate server-side total. However, the root fix may be to limit query concurrency or raise server RAM. External aggregation reduces per-query peaks, not query count.

Q: Can I apply these settings cluster-wide so I do not have to add SETTINGS to every query?
A: Yes. Add the settings to the default user profile in users.xml or via a CREATE SETTINGS PROFILE statement:

CREATE SETTINGS PROFILE heavy_aggregation
    SETTINGS
        max_bytes_before_external_group_by = 10000000000,
        max_memory_usage = 20000000000,
        max_bytes_before_external_sort = 10000000000;

Apply the profile to users or roles to enforce it without modifying individual queries.

Q: Does optimize_aggregation_in_order always reduce memory for GROUP BY?
A: Only when the GROUP BY columns form an exact prefix of the table's ORDER BY / primary key. If the condition is not met, the setting has no effect — ClickHouse falls back to the standard hash table approach. It is disabled by default because it reduces parallelism, which typically outweighs the memory savings for well-fitted queries.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.