NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse Aggressive Merges Tuning Guide

Aggressive merge tuning is the right move when ClickHouse keeps accumulating parts faster than the default background scheduler can merge them. Symptoms include rising part counts in system.parts, Too many parts rejections, slow SELECT FINAL queries, and large system.replication_queue backlogs. The fix is to grant the merge subsystem more threads, more concurrent tasks, and larger per-merge ceilings, then verify the pool is actually busy. This guide lists the knobs that matter, sensible starting values, and the trade-offs you accept when you push them up.

When to Tune for Aggressive Merges

Default ClickHouse settings target balanced read/write workloads on modest hardware. They are conservative because aggressive merging consumes CPU, RAM, and disk bandwidth that would otherwise serve queries. Push the merge subsystem harder when one or more of the following hold:

  • Ingest is bursty or sustained at high rate, producing many small parts per partition.
  • The node has spare CPU cores and disk bandwidth during ingest windows.
  • Latency-sensitive reads do not run on the same node, or run on a separate replica.
  • Queries depend on collapsing engines (ReplacingMergeTree, CollapsingMergeTree, AggregatingMergeTree) and FINAL performance suffers from too many unmerged parts.

If the node must sustain low-latency reads and writes on the same hardware, raise these values cautiously, since every thread you give to merges is a thread you take away from queries.

Server-Level Settings

These go in config.xml (or a drop-in under config.d/). They require a server restart, except where noted.

Setting Default Purpose
background_pool_size 16 Threads available for merges and mutations.
background_merges_mutations_concurrency_ratio 2 Maximum tasks per thread in the pool. Effective task ceiling is background_pool_size * ratio.
background_merges_mutations_scheduling_policy round_robin Choose shortest_task_first to drain small merges quickly, or round_robin for fairness.
background_schedule_pool_size 512 Scheduler threads for replicated table coordination, distributed sends, materialized views.
background_common_pool_size 8 Threads for non-merge background tasks (TTL moves, async loads).

A common aggressive baseline on a 32-core ingest node:

<clickhouse>
  <background_pool_size>36</background_pool_size>
  <background_common_pool_size>8</background_common_pool_size>
  <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
</clickhouse>

Setting concurrency_ratio to 1 keeps one task per thread; this reduces context switching when merges are large and CPU-bound. Raise it back to 2 if your merges are I/O-bound and threads sit idle.

MergeTree-Level Settings

These go under <merge_tree> in config.xml, or as SETTINGS on individual tables. They can also be altered per table:

ALTER TABLE events MODIFY SETTING max_bytes_to_merge_at_max_space_in_pool = 161061273600;
Setting Default Purpose
max_bytes_to_merge_at_max_space_in_pool 161061273600 (150 GiB) Largest target part size when the merge pool has free capacity. Raise on big-disk nodes to allow fewer, larger parts.
number_of_free_entries_in_pool_to_lower_max_size_of_merge 8 When the pool drops below this many free slots, the maximum mergeable part size shrinks. Increase to keep merges small under pressure.
max_replicated_merges_in_queue 1000 Maximum merge entries the replication queue accepts. Raise alongside background_pool_size on busy replicated tables.
min_merge_bytes_to_use_direct_io 10737418240 (10 GiB) Merges larger than this skip the page cache and use direct I/O. Lowers cache pollution from huge merges.
execute_merges_on_single_replica_time_threshold 0 When set (in seconds), only one replica executes the merge; the others fetch the result. Cuts cluster-wide CPU.
min_bytes_for_wide_part / min_rows_for_wide_part 10485760 / 0 Threshold for switching to wide part format. Wide parts merge faster on large data.

Full aggressive example combining server and merge_tree sections:

<clickhouse>
  <background_pool_size>36</background_pool_size>
  <background_common_pool_size>8</background_common_pool_size>
  <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
  <merge_tree>
    <number_of_free_entries_in_pool_to_lower_max_size_of_merge>32</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
    <max_replicated_merges_in_queue>36</max_replicated_merges_in_queue>
    <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
    <min_merge_bytes_to_use_direct_io>10737418240</min_merge_bytes_to_use_direct_io>
  </merge_tree>
</clickhouse>

Verifying the Pool Is Actually Busy

Tuning is pointless if the pool sits idle. Check thread utilization:

SELECT * FROM system.metrics WHERE metric LIKE '%PoolTask';

Key rows:

  • BackgroundMergesAndMutationsPoolTask: currently running merges and mutations.
  • BackgroundSchedulePoolTask: replicated coordination work in progress.
  • BackgroundFetchesPoolTask: replica fetches.

Compare against BackgroundMergesAndMutationsPoolSize. If running tasks plateau well below pool size while parts pile up, the bottleneck is elsewhere (disk, network, ZooKeeper/Keeper, or scheduling policy). If tasks sit at the ceiling and the queue keeps growing, raise the pool size.

Inspect active merges directly:

SELECT database, table, elapsed, progress, num_parts, total_size_bytes_compressed
FROM system.merges
ORDER BY elapsed DESC;

And the replicated queue:

SELECT type, count() FROM system.replication_queue GROUP BY type;

Common Pitfalls

  • Raising background_pool_size without raising max_replicated_merges_in_queue. The replication queue stays the bottleneck and new threads sit idle on Replicated tables.
  • Pushing max_bytes_to_merge_at_max_space_in_pool above the free disk available for a single merge. A merge needs temporary space roughly equal to the sum of source part sizes; running out of disk fails the merge and rolls back.
  • Aggressive merges on the same node serving queries. Expect spiky query latency. Pin merges to a dedicated replica using execute_merges_on_single_replica_time_threshold.
  • Forgetting background_merges_mutations_concurrency_ratio. The effective task count is pool_size times this ratio. A ratio of 2 with pool_size 36 gives 72 concurrent tasks, which can swamp small disks.
  • Tuning without measuring. Always capture system.parts count, system.merges throughput, and disk I/O before and after a change.

Frequently Asked Questions

Q: What is the difference between background_pool_size and background_merges_mutations_concurrency_ratio? A: background_pool_size sets the number of OS threads dedicated to merges and mutations. The concurrency ratio caps how many task slots each thread can hold open. Effective concurrent task ceiling is background_pool_size * background_merges_mutations_concurrency_ratio.

Q: Will increasing max_bytes_to_merge_at_max_space_in_pool help with Too many parts? A: Indirectly. A larger ceiling lets ClickHouse merge bigger parts together when the pool has free slots, reducing the long-term part count. It does not help when the pool is already saturated. Raise background_pool_size and number_of_free_entries_in_pool_to_lower_max_size_of_merge first.

Q: Should I set background_merges_mutations_scheduling_policy to shortest_task_first? A: Use shortest_task_first when you want small merges to drain quickly and reduce part count under pressure. Use round_robin (the default) when you want fairness across tables and predictable merge latency. The former helps with Too many parts; the latter avoids starving large merges.

Q: How do I make merges run on only one replica? A: Set execute_merges_on_single_replica_time_threshold to a value in seconds (for example, 600). Replicas wait that long for one replica to complete the merge before falling back to merging locally. This cuts cluster-wide merge CPU significantly on replicated tables.

Q: Can I change merge settings without restarting ClickHouse? A: MergeTree-level settings on individual tables are changeable via ALTER TABLE ... MODIFY SETTING. Server-level settings under <background_*> and the global <merge_tree> block require a restart. Use table-level overrides for hot fixes on specific tables.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.