ClickHouse Aggressive Merges Tuning Guide

Q: What is the difference between `background_pool_size` and `background_merges_mutations_concurrency_ratio`?

background_pool_size sets the number of OS threads dedicated to merges and mutations. The concurrency ratio caps how many task slots each thread can hold open. Effective concurrent task ceiling is background_pool_size * background_merges_mutations_concurrency_ratio .

Q: Should I set `background_merges_mutations_scheduling_policy` to `shortest_task_first`?

Use shortest_task_first when you want small merges to drain quickly and reduce part count under pressure. Use round_robin (the default) when you want fairness across tables and predictable merge latency. The former helps with Too many parts ; the latter avoids starving large merges.

Q: How do I make merges run on only one replica?

Set execute_merges_on_single_replica_time_threshold to a value in seconds (for example, 600). Replicas wait that long for one replica to complete the merge before falling back to merging locally. This cuts cluster-wide merge CPU significantly on replicated tables.

Aggressive merge tuning is the right move when ClickHouse keeps accumulating parts faster than the default background scheduler can merge them. Symptoms include rising part counts in system.parts, Too many parts rejections, slow SELECT FINAL queries, and large system.replication_queue backlogs. The fix is to grant the merge subsystem more threads, more concurrent tasks, and larger per-merge ceilings, then verify the pool is actually busy. This guide lists the knobs that matter, sensible starting values, and the trade-offs you accept when you push them up.

When to Tune for Aggressive Merges

Default ClickHouse settings target balanced read/write workloads on modest hardware. They are conservative because aggressive merging consumes CPU, RAM, and disk bandwidth that would otherwise serve queries. Push the merge subsystem harder when one or more of the following hold:

Ingest is bursty or sustained at high rate, producing many small parts per partition.
The node has spare CPU cores and disk bandwidth during ingest windows.
Latency-sensitive reads do not run on the same node, or run on a separate replica.
Queries depend on collapsing engines (ReplacingMergeTree, CollapsingMergeTree, AggregatingMergeTree) and FINAL performance suffers from too many unmerged parts.

If the node must sustain low-latency reads and writes on the same hardware, raise these values cautiously, since every thread you give to merges is a thread you take away from queries.

Server-Level Settings

These go in config.xml (or a drop-in under config.d/). They require a server restart, except where noted.

Setting	Default	Purpose
`background_pool_size`	16	Threads available for merges and mutations.
`background_merges_mutations_concurrency_ratio`	2	Maximum tasks per thread in the pool. Effective task ceiling is `background_pool_size * ratio`.
`background_merges_mutations_scheduling_policy`	`round_robin`	Choose `shortest_task_first` to drain small merges quickly, or `round_robin` for fairness.
`background_schedule_pool_size`	512	Scheduler threads for replicated table coordination, distributed sends, materialized views.
`background_common_pool_size`	8	Threads for non-merge background tasks (TTL moves, async loads).

A common aggressive baseline on a 32-core ingest node:

<clickhouse>
  <background_pool_size>36</background_pool_size>
  <background_common_pool_size>8</background_common_pool_size>
  <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
</clickhouse>

Setting concurrency_ratio to 1 keeps one task per thread; this reduces context switching when merges are large and CPU-bound. Raise it back to 2 if your merges are I/O-bound and threads sit idle.

MergeTree-Level Settings

These go under <merge_tree> in config.xml, or as SETTINGS on individual tables. They can also be altered per table:

ALTER TABLE events MODIFY SETTING max_bytes_to_merge_at_max_space_in_pool = 161061273600;

Setting	Default	Purpose
`max_bytes_to_merge_at_max_space_in_pool`	161061273600 (150 GiB)	Largest target part size when the merge pool has free capacity. Raise on big-disk nodes to allow fewer, larger parts.
`number_of_free_entries_in_pool_to_lower_max_size_of_merge`	8	When the pool drops below this many free slots, the maximum mergeable part size shrinks. Increase to keep merges small under pressure.
`max_replicated_merges_in_queue`	1000	Maximum merge entries the replication queue accepts. Raise alongside `background_pool_size` on busy replicated tables.
`min_merge_bytes_to_use_direct_io`	10737418240 (10 GiB)	Merges larger than this skip the page cache and use direct I/O. Lowers cache pollution from huge merges.
`execute_merges_on_single_replica_time_threshold`	0	When set (in seconds), only one replica executes the merge; the others fetch the result. Cuts cluster-wide CPU.
`min_bytes_for_wide_part` / `min_rows_for_wide_part`	10485760 / 0	Threshold for switching to wide part format. Wide parts merge faster on large data.

Full aggressive example combining server and merge_tree sections:

<clickhouse>
  <background_pool_size>36</background_pool_size>
  <background_common_pool_size>8</background_common_pool_size>
  <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
  <merge_tree>
    <number_of_free_entries_in_pool_to_lower_max_size_of_merge>32</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
    <max_replicated_merges_in_queue>36</max_replicated_merges_in_queue>
    <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
    <min_merge_bytes_to_use_direct_io>10737418240</min_merge_bytes_to_use_direct_io>
  </merge_tree>
</clickhouse>

Verifying the Pool Is Actually Busy

Tuning is pointless if the pool sits idle. Check thread utilization:

SELECT * FROM system.metrics WHERE metric LIKE '%PoolTask';

Key rows:

BackgroundMergesAndMutationsPoolTask: currently running merges and mutations.
BackgroundSchedulePoolTask: replicated coordination work in progress.
BackgroundFetchesPoolTask: replica fetches.

Compare against BackgroundMergesAndMutationsPoolSize. If running tasks plateau well below pool size while parts pile up, the bottleneck is elsewhere (disk, network, ZooKeeper/Keeper, or scheduling policy). If tasks sit at the ceiling and the queue keeps growing, raise the pool size.

Inspect active merges directly:

SELECT database, table, elapsed, progress, num_parts, total_size_bytes_compressed
FROM system.merges
ORDER BY elapsed DESC;

And the replicated queue:

SELECT type, count() FROM system.replication_queue GROUP BY type;

Common Pitfalls

Raising background_pool_size without raising max_replicated_merges_in_queue. The replication queue stays the bottleneck and new threads sit idle on Replicated tables.
Pushing max_bytes_to_merge_at_max_space_in_pool above the free disk available for a single merge. A merge needs temporary space roughly equal to the sum of source part sizes; running out of disk fails the merge and rolls back.
Aggressive merges on the same node serving queries. Expect spiky query latency. Pin merges to a dedicated replica using execute_merges_on_single_replica_time_threshold.
Forgetting background_merges_mutations_concurrency_ratio. The effective task count is pool_size times this ratio. A ratio of 2 with pool_size 36 gives 72 concurrent tasks, which can swamp small disks.
Tuning without measuring. Always capture system.parts count, system.merges throughput, and disk I/O before and after a change.

Frequently Asked Questions

Q: What is the difference between background_pool_size and background_merges_mutations_concurrency_ratio? A: background_pool_size sets the number of OS threads dedicated to merges and mutations. The concurrency ratio caps how many task slots each thread can hold open. Effective concurrent task ceiling is background_pool_size * background_merges_mutations_concurrency_ratio.

Q: Will increasing max_bytes_to_merge_at_max_space_in_pool help with Too many parts? A: Indirectly. A larger ceiling lets ClickHouse merge bigger parts together when the pool has free slots, reducing the long-term part count. It does not help when the pool is already saturated. Raise background_pool_size and number_of_free_entries_in_pool_to_lower_max_size_of_merge first.

Q: Should I set background_merges_mutations_scheduling_policy to shortest_task_first? A: Use shortest_task_first when you want small merges to drain quickly and reduce part count under pressure. Use round_robin (the default) when you want fairness across tables and predictable merge latency. The former helps with Too many parts; the latter avoids starving large merges.

Q: How do I make merges run on only one replica? A: Set execute_merges_on_single_replica_time_threshold to a value in seconds (for example, 600). Replicas wait that long for one replica to complete the merge before falling back to merging locally. This cuts cluster-wide merge CPU significantly on replicated tables.

Q: Can I change merge settings without restarting ClickHouse? A: MergeTree-level settings on individual tables are changeable via ALTER TABLE ... MODIFY SETTING. Server-level settings under <background_*> and the global <merge_tree> block require a restart. Use table-level overrides for hot fixes on specific tables.