ClickHouse Memory Usage Diagnosis: Where Did It Go?

Q: My server's RSS keeps growing but no query is using much memory. Is this a leak?

Usually not. Run SYSTEM JEMALLOC PURGE , then check primary key memory in system.parts and dictionary memory in system.dictionaries . Both grow over time as you add data.

Q: How do I cap memory per query?

Set max_memory_usage on the user profile. The default is 10 GB per query.

A ClickHouse process can use far more RAM than the sum of running queries, and the breakdown isn't obvious from top. Memory ends up in caches, dictionaries loaded into RAM, primary keys, active merges, temporary in-memory tables, async insert buffers, and the jemalloc allocator's slack. This article gives you a category-by-category SQL walkthrough for finding the consumer.

Step zero: purge jemalloc

Before believing the number in system.asynchronous_metrics, run:

SYSTEM JEMALLOC PURGE;

jemalloc holds onto freed memory in case the application asks for it again. The purge returns dirty pages to the OS and gives you an accurate MemoryResident reading. If RSS dropped substantially after the purge, you didn't have a leak, just allocator slack.

The headline number

SELECT
    formatReadableSize(value) AS mem
FROM system.asynchronous_metrics
WHERE metric = 'MemoryResident';

This is the same number as RES in top. Everything below tries to explain it.

Caches

The mark cache and uncompressed cache are large and explicit allocations:

SELECT
    metric,
    formatReadableSize(value) AS bytes
FROM system.asynchronous_metrics
WHERE metric IN (
    'MarkCacheBytes',
    'UncompressedCacheBytes',
    'CompiledExpressionCacheBytes'
)
ORDER BY value DESC;

MarkCacheBytes defaults to a cap of 5 GiB and tends to stay near the cap on busy servers. UncompressedCacheBytes is zero unless the cache is enabled (it usually isn't).

The MMappedFiles cache holds memory-mapped data files. Visible separately:

SELECT formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'MMapCacheCells';

Primary keys in memory

Every active part keeps its primary key index resident. For a sharded cluster with many parts, this adds up:

SELECT
    database,
    table,
    formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) AS pk_mem
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(primary_key_bytes_in_memory_allocated) DESC
LIMIT 20;

A single table with hundreds of partitions and millions of granules can sit on multiple GB of primary key memory.

Dictionaries

Loaded dictionaries are pinned in RAM. List them by size:

SELECT
    database,
    name,
    type,
    formatReadableSize(bytes_allocated) AS bytes
FROM system.dictionaries
ORDER BY bytes_allocated DESC;

hashed and complex_key_hashed dictionaries are the usual offenders. If a dictionary is large enough to matter and is only used occasionally, switch it to cache or direct layout so it doesn't sit in memory full-time.

Active merges

A wide-part merge on a table with hundreds of columns can allocate gigabytes. See the companion article on system.metric_log merge memory for the worst-case pattern.

SELECT
    database,
    table,
    formatReadableSize(memory_usage) AS mem,
    elapsed,
    progress
FROM system.merges
ORDER BY memory_usage DESC;

Running queries

system.processes includes per-query memory:

SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage) AS mem,
    formatReadableSize(peak_memory_usage) AS peak,
    substring(query, 1, 120) AS query
FROM system.processes
ORDER BY memory_usage DESC;

peak_memory_usage is the max during the query so far; on a long-running aggregation it can be much higher than memory_usage at any single instant.

Temporary in-memory tables

The Memory, Set, Join, and Buffer engines all hold data in RAM. Some are user tables, some are created behind the scenes by JOIN and IN clauses.

SELECT
    database,
    name,
    engine,
    formatReadableSize(total_bytes) AS bytes
FROM system.tables
WHERE engine IN ('Memory', 'Set', 'Join', 'Buffer')
ORDER BY total_bytes DESC;

Buffer tables in particular can be surprising: a buffer in front of a high-write MergeTree will hold up to max_bytes of data at all times.

In-memory parts

InMemory part_type was an experimental feature for fast inserts. Confirm none are sticking around:

SELECT
    database,
    table,
    count() AS in_memory_parts,
    formatReadableSize(sum(bytes_on_disk)) AS bytes
FROM system.parts
WHERE active AND part_type = 'InMemory'
GROUP BY database, table;

Asynchronous inserts

If you enable async_insert, inserted data sits in server-side buffers until flushed:

SELECT
    database,
    table,
    formatReadableSize(sum(total_bytes)) AS buffered
FROM system.asynchronous_inserts
GROUP BY database, table
ORDER BY sum(total_bytes) DESC;

Historical memory peaks

When you're investigating "why did RSS spike at 03:14 last night", the running queries are long gone. system.query_log keeps memory_usage and peak_memory_usage per query:

SELECT
    event_time,
    query_id,
    user,
    query_kind,
    formatReadableSize(memory_usage) AS mem,
    formatReadableSize(peak_memory_usage) AS peak,
    substring(query, 1, 120) AS query
FROM system.query_log
WHERE event_time BETWEEN now() - INTERVAL 24 HOUR AND now()
  AND type = 'QueryFinish'
ORDER BY peak_memory_usage DESC
LIMIT 20;

system.part_log is the equivalent for merges and mutations:

SELECT
    event_time,
    database,
    table,
    event_type,
    formatReadableSize(peak_memory_usage) AS peak,
    duration_ms
FROM system.part_log
WHERE event_time > now() - INTERVAL 24 HOUR
ORDER BY peak_memory_usage DESC
LIMIT 20;

Memory trace sampling

For repeating spikes that none of the above explain, enable memory-profile sampling and inspect system.trace_log:

SET memory_profiler_step = 4194304;
SET memory_profiler_sample_probability = 0.01;
-- run workload
SELECT
    arrayStringConcat(arrayMap(x -> demangle(addressToSymbol(x)), trace), '\n') AS stack,
    count() AS samples,
    sum(size) AS bytes
FROM system.trace_log
WHERE trace_type = 'MemorySample'
  AND event_time > now() - INTERVAL 10 MINUTE
GROUP BY trace
ORDER BY bytes DESC
LIMIT 10;

Each row is a stack trace that allocated memory, sampled at the configured probability.

Server log fallback

When SQL access is impaired, the text log carries MemoryTracker lines that show high-water marks per server, per user, and per query:

grep MemoryTracker /var/log/clickhouse-server/clickhouse-server.log | tail -50
zgrep MemoryTracker /var/log/clickhouse-server/clickhouse-server.log.*.gz | tail

Putting it together

A typical breakdown query for an at-a-glance check:

WITH (SELECT value FROM system.asynchronous_metrics WHERE metric = 'MemoryResident') AS rss
SELECT 'RSS' AS category, formatReadableSize(rss) AS bytes
UNION ALL SELECT 'Mark cache', formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'MarkCacheBytes'
UNION ALL SELECT 'Uncompressed cache', formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'UncompressedCacheBytes'
UNION ALL SELECT 'Primary keys', formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) FROM system.parts WHERE active
UNION ALL SELECT 'Dictionaries', formatReadableSize(sum(bytes_allocated)) FROM system.dictionaries
UNION ALL SELECT 'Active merges', formatReadableSize(sum(memory_usage)) FROM system.merges
UNION ALL SELECT 'Running queries', formatReadableSize(sum(memory_usage)) FROM system.processes;

The categories won't add up to RSS exactly (jemalloc slack, internal pools, network buffers), but if they're not within shouting distance you have an unidentified consumer to chase.

Common Pitfalls

Assuming RSS == leaked memory. Run SYSTEM JEMALLOC PURGE first.
Forgetting that Buffer engine tables always hold their configured maximum bytes in RAM.
Looking at memory_usage for running queries and missing peak_memory_usage. The peak is what crashed the server, not the current value.
Treating dictionary memory as a constant. Refreshing a large hashed dictionary temporarily doubles its memory.

Frequently Asked Questions

Q: My server's RSS keeps growing but no query is using much memory. Is this a leak? A: Usually not. Run SYSTEM JEMALLOC PURGE, then check primary key memory in system.parts and dictionary memory in system.dictionaries. Both grow over time as you add data.

Q: How do I find the largest memory query from yesterday? A: SELECT * FROM system.query_log WHERE event_date = yesterday() AND type = 'QueryFinish' ORDER BY peak_memory_usage DESC LIMIT 10;

Q: How do I cap memory per query? A: Set max_memory_usage on the user profile. The default is 10 GB per query.

Q: What is MemoryTracking in system.metrics? A: It's the sum of all tracked allocations on the server. It is the number ClickHouse compares against max_server_memory_usage.

Q: Does SYSTEM JEMALLOC PURGE block queries? A: It briefly stalls allocations while pages are returned to the OS. The effect is in milliseconds for most servers but can be longer on systems with very fragmented heaps.