A common surprise on production ClickHouse servers is a memory spike of several gigabytes coming from the merge of system.metric_log. The table is built from short rows containing many counter columns, and the default merge-tree thresholds put it in the worst possible spot: wide parts, many columns, horizontal merge. This article explains what is happening and which settings fix it.
The bug, in one sentence
system.metric_log has thousands of columns and very narrow rows, so it crosses the byte threshold for "wide" part format long before it crosses the row threshold for the cheaper "vertical" merge algorithm, leaving it stuck with horizontal merges that open every column at once.
Wide vs. compact parts
MergeTree stores each part either as Compact (all columns packed into one file) or Wide (one file per column). The switch is controlled by two settings:
min_bytes_for_wide_part(default 10 MB) - parts larger than this are written as wide.min_rows_for_wide_part(default 0, meaning disabled) - alternative row-based trigger.
metric_log rows average about 2.8 KB once the asynchronous metrics columns are added. That means a part crosses 10 MB after roughly 3,744 rows, which happens within minutes of a fresh server starting.
Horizontal vs. vertical merges
When merging wide parts, ClickHouse can use either algorithm:
- Horizontal merge: opens all column files for all source parts simultaneously, holds a buffer per (column, part) pair, then writes the merged columns to the destination.
- Vertical merge: processes columns in batches. Only a few columns are kept in memory at a time.
Vertical merges only activate when the part contains at least vertical_merge_algorithm_min_rows_to_activate rows (default 131,072) and vertical_merge_algorithm_min_columns_to_activate columns (default 11). With the default min_bytes_for_wide_part = 10 MB, metric_log switches to wide format at about 3,744 rows but does not qualify for vertical merge until 131,072 rows. The merge runs horizontally on a table with hundreds of columns, allocates a buffer for each (column, source part) combination, and memory usage explodes.
In one reproducer the same merge dropped from 6 GB to 14 MB of RAM when forced down the vertical path.
Local fix: override the table definition
If you only want to address metric_log itself, override the system table config so parts stay compact until they are large enough for vertical merges:
<metric_log replace="1">
<database>system</database>
<table>metric_log</table>
<engine>
ENGINE = MergeTree
PARTITION BY (event_date)
ORDER BY (event_time)
TTL event_date + INTERVAL 14 DAY DELETE
SETTINGS min_bytes_for_wide_part = 536870912
</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</metric_log>
536870912 is 512 MB. With this threshold the table stays in compact format until it has well over 131,072 rows, at which point a vertical merge becomes possible.
The same pattern applies to asynchronous_metric_log, query_log, query_thread_log, and any other wide system table that gives you merge memory grief.
Global fix: switch the trigger to row-based
A cleaner fix is to make the wide-vs-compact decision row-based across the whole server, so it aligns with the vertical merge threshold:
<clickhouse>
<merge_tree>
<min_bytes_for_wide_part>0</min_bytes_for_wide_part>
<min_rows_for_wide_part>131072</min_rows_for_wide_part>
</merge_tree>
</clickhouse>
min_bytes_for_wide_part = 0 disables the byte trigger. min_rows_for_wide_part = 131072 aligns the wide threshold with vertical_merge_algorithm_min_rows_to_activate, so by the time a part is wide it is also large enough to qualify for a vertical merge.
This setting becomes the default for every new table on the server, including user tables. If you have user tables where you specifically want the byte trigger, set it back at the table level.
Verifying the fix
After applying either fix, check that future parts are using the expected format:
SELECT
part_type,
count() AS parts,
sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE database = 'system' AND table = 'metric_log' AND active
GROUP BY part_type;
You want to see Compact parts on rows below the row threshold and Wide parts only above it. Monitor merge memory with:
SELECT
table,
formatReadableSize(memory_usage) AS mem,
elapsed,
progress
FROM system.merges
WHERE database = 'system'
ORDER BY memory_usage DESC;
Trade-offs
Pushing the wide threshold higher keeps data in compact format longer. Compact parts have lower insert overhead and fewer files (cheaper on object storage backends because each column file is a separate S3 object). The downside is that compact parts are less efficient for column-selective reads. For metric_log queries you almost always look at one or two columns at a time, so the read penalty is real, but the alternative is gigabyte-sized merges.
If your metric_log is purely for short-term observability, an even simpler fix is to drop the table's retention to a few days with TTL and accept that you never have parts large enough to need vertical merges.
Common Pitfalls
- Tuning
max_bytes_to_merge_at_max_space_in_poolthinking it caps memory. It caps merged data size, not memory usage. - Disabling
metric_logentirely with<metric_log remove="1"/>. That works, but you lose the historical telemetry that makes monitoring possible. - Setting
min_bytes_for_wide_part = 0without also settingmin_rows_for_wide_part. You then never get wide parts and forfeit column-selective read efficiency. - Applying the fix globally without considering that user tables with rare DELETE/UPDATE mutations may prefer wide parts up front.
Frequently Asked Questions
Q: Can I just turn off metric_log?
A: Yes, with <metric_log remove="1"/> in config.d/. You then lose the ability to graph historical ClickHouse-internal metrics, which most operators want to keep.
Q: Does this also affect query_log and query_thread_log?
A: Yes. They have similar shapes (many columns, modest row sizes) and are good candidates for the same fix.
Q: What ClickHouse version introduced vertical merges?
A: They have been present for years, but the defaults that produced the memory spike (min_bytes_for_wide_part = 10 MB combined with vertical_merge_algorithm_min_rows_to_activate = 131072) have been in place since around 20.x.
Q: Will setting min_bytes_for_wide_part = 0 globally break my user tables?
A: It won't break them, but tables that benefit from column-level skip indexes and selective reads will perform worse on aggregations. Apply globally only if your workload tolerates it, otherwise scope the change to the system database.
Q: How do I confirm a vertical merge was used?
A: Look at system.part_log for the rows produced by the merge: the merge_algorithm column shows Vertical or Horizontal.