ClickHouse Background Merges Dominate CPU and IO

You open top on a ClickHouse server during a quiet period and find 200–400% CPU sustained across multiple cores. No complex queries are running. The culprit is almost always background merges: the continuous compaction work that MergeTree engines perform to consolidate the many small parts produced by inserts into fewer, larger parts. This is expected behavior, but it can saturate resources and compete with query workloads when it is not sized or tuned correctly.

What This Error Means

ClickHouse's MergeTree engine uses an LSM-inspired write path. Every INSERT writes one or more immutable, sorted data parts directly to disk — one part per partition key value touched in that insert. Left unmerged, these accumulate rapidly. Background merges consolidate smaller parts into larger ones continuously, until parts approach the max_bytes_to_merge_at_max_space_in_pool ceiling (default 150 GiB compressed).

Each merge reads the source parts, decompresses and deserializes the columnar data, sorts and merges rows, recompresses the result, and writes it back to disk. This is fundamentally CPU- and IO-intensive work. A server ingesting at any meaningful rate will have background merges running at all times.

The default thread pool is controlled by background_pool_size (default 16 threads) multiplied by background_merges_mutations_concurrency_ratio (default 2.0), giving up to 32 concurrent merge tasks. On a server with 8 vCPUs, this means merges can legitimately occupy all cores for extended periods.

Common Causes

  1. Small or frequent inserts. Each INSERT writes at least one new part per touched partition. Hundreds of small inserts per second each create their own parts, which accumulate faster than the default pool can merge them. The recommended insert batch size is at least 1,000 rows per INSERT per partition, ideally 10,000–100,000.

  2. Over-partitioned tables. A partition key like toDateTime(event_time) (partitioned by second) creates a new partition — and thus a new part — for every second of data. Even large batches then produce many parts across many partitions, overwhelming the merge scheduler.

  3. Background pool undersized for the ingest rate. The default background_pool_size = 16 is conservative. High-ingest nodes can exhaust the pool, leaving parts accumulating while threads are all occupied with in-flight merges.

  4. Disk IO saturation. Merges larger than min_merge_bytes_to_use_direct_io (default 10 GiB) use O_DIRECT to bypass the page cache. If disk throughput is already near capacity, individual merges slow down, the pool stays full, and new parts back up behind them.

  5. Mutations running concurrently. ALTER TABLE ... UPDATE/DELETE mutations run through the same background merge pool as regular merges. A large mutation can monopolize pool slots for hours, starving regular part consolidation. Mutations will not be scheduled unless the pool has at least number_of_free_entries_in_pool_to_execute_mutation (default 20) free slots.

  6. OPTIMIZE TABLE ... FINAL called in production. This command forces an immediate merge of all parts to one per partition, bypassing the normal scheduler. It is blocking and resource-intensive and will dominate CPU and IO for the duration. Avoid it in production except for specific use cases like forcing deduplication with ReplacingMergeTree.

How to Fix

1. Increase insert batch size

The most impactful change is reducing the number of parts created per unit of time. Batch inserts from the application layer to at least 10,000 rows per call per partition. Where the client cannot batch (e.g., event-driven architectures with many small producers), enable asynchronous inserts on the server:

-- Per-session or per-user setting
SET async_insert = 1;
SET async_insert_max_data_size = 10485760;       -- flush when buffer hits 10 MiB
SET async_insert_busy_timeout_max_ms = 200;       -- or after 200 ms

Asynchronous inserts buffer multiple small client inserts server-side and flush them as a single batched part, achieving the same effect as client-side batching without changing application code.

2. Fix the partition key

A partition key should produce partitions in the range of months or weeks — not days or hours unless you have a very high write rate. A partition per second or per hour will produce an unmanageable number of parts even with large batches:

-- Avoid: partitions by day on a busy table with many short time windows
-- PARTITION BY toYYYYMMDD(event_time)

-- Better: partition by month on most tables
-- PARTITION BY toYYYYMM(event_time)

Changing the partition key requires recreating the table. Evaluate this early; it is difficult to change later.

3. Increase background pool size

If the pool is consistently full and disk has headroom, raise background_pool_size. This is a server-level setting that goes in config.xml or a file under config.d/. Increasing it takes effect immediately; decreasing it requires a restart:

<clickhouse>
  <background_pool_size>32</background_pool_size>
  <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
</clickhouse>

Setting background_merges_mutations_concurrency_ratio to 1 keeps one task per thread, reducing context switching when merges are CPU-bound. Use 2 when merges are IO-bound and threads often idle waiting on disk.

4. Change the scheduling policy

The default round_robin policy distributes merge work fairly across tables. When part counts are climbing and you need small parts merged quickly, switch to shortest_task_first:

<background_merges_mutations_scheduling_policy>shortest_task_first</background_merges_mutations_scheduling_policy>

This prioritizes small merges, which reduces part debt faster. The trade-off is that large merges can be starved if small ones keep arriving — switch back to round_robin if you see large parts never completing.

5. Adjust per-table merge pressure thresholds

If INSERT throttling or rejection begins before the merge pool can catch up, tune the delay and throw thresholds at the table level. These defaults (since version 23.6) are generous:

ALTER TABLE my_table MODIFY SETTING
    parts_to_delay_insert = 500,
    parts_to_throw_insert = 1000,
    number_of_free_entries_in_pool_to_lower_max_size_of_merge = 4;

Lowering parts_to_delay_insert and parts_to_throw_insert applies INSERT back-pressure earlier, giving the merge pool more time to catch up before rejection thresholds are hit. Lowering number_of_free_entries_in_pool_to_lower_max_size_of_merge (default 8) reduces how aggressively ClickHouse shrinks the maximum merge size under pool pressure.

6. Pause merges during bulk loads

When performing a large one-time bulk load, stop merges for the target table to prevent the background pool from thrashing on thousands of tiny parts. Resume merges after the load completes and let the pool work through the backlog at full capacity:

SYSTEM STOP MERGES my_database.my_table;

-- perform bulk INSERT operations here

SYSTEM START MERGES my_database.my_table;

Root-Cause Analysis

Check what is currently merging

SELECT
    database,
    table,
    round(elapsed, 0)                                               AS elapsed_sec,
    round(progress, 4)                                             AS progress,
    formatReadableTimeDelta((elapsed / progress) - elapsed)        AS eta,
    num_parts,
    formatReadableSize(total_size_bytes_compressed)                AS total_compressed,
    formatReadableSize(memory_usage)                               AS memory_usage,
    merge_algorithm,
    result_part_name
FROM system.merges
ORDER BY elapsed DESC;

If system.merges consistently shows the pool near capacity (close to background_pool_size * ratio rows), the pool is the bottleneck. If individual merges show slow progress despite few concurrent merges, IO is the bottleneck.

Check merge pool utilization

SELECT metric, value
FROM system.metrics
WHERE metric IN (
    'BackgroundMergesAndMutationsPoolTask',
    'BackgroundMergesAndMutationsPoolSize'
);

BackgroundMergesAndMutationsPoolTask at or near BackgroundMergesAndMutationsPoolSize means the pool is saturated.

Find which partitions have the most parts

SELECT
    database,
    table,
    partition_id,
    count()                              AS active_parts,
    formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
HAVING active_parts > 100
ORDER BY active_parts DESC
LIMIT 50;

Track the MaxPartCountForPartition metric over time

MaxPartCountForPartition is the key merge health indicator: the highest active part count in any single partition across all MergeTree tables. Values consistently above 100 indicate merge lag. Values approaching parts_to_delay_insert (default 1000) mean INSERT throttling is beginning:

-- Current value
SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric = 'MaxPartCountForPartition';

-- Trend over the last 24 hours
SELECT
    toStartOfMinute(event_time) AS minute,
    avg(value)                  AS avg_max_parts
FROM system.asynchronous_metric_log
WHERE metric = 'MaxPartCountForPartition'
  AND event_time > now() - INTERVAL 1 DAY
GROUP BY minute
ORDER BY minute ASC;

Measure the insert rate as new parts per minute

SELECT
    toStartOfMinute(event_time) AS minute,
    table,
    count()                               AS new_parts,
    sum(rows)                             AS total_rows,
    formatReadableSize(sum(size_in_bytes)) AS total_bytes
FROM system.part_log
WHERE event_type = 'NewPart'
  AND event_time > now() - INTERVAL 2 HOUR
GROUP BY minute, table
ORDER BY minute ASC, table;

Compare the new-parts-per-minute rate against merge completions from event_type = 'MergeParts' in the same window. If new parts arrive faster than merges complete, the backlog will grow.

Check INSERT latency for throttling symptoms

SELECT
    toStartOfMinute(event_time) AS minute,
    avg(query_duration_ms)      AS avg_insert_duration_ms,
    count()                     AS insert_count
FROM system.query_log
WHERE query_kind = 'Insert'
  AND type = 'QueryFinish'
  AND event_time > now() - INTERVAL 3 HOUR
GROUP BY minute
ORDER BY minute ASC;

A spike in average INSERT duration without a spike in data volume written indicates ClickHouse is throttling inserts with a computed delay, which means MaxPartCountForPartition is approaching parts_to_delay_insert.

Identify tables with the most total active parts

SELECT
    concat(database, '.', table)          AS full_table,
    countIf(active = 1)                   AS active_parts,
    countIf(active = 0)                   AS inactive_parts,
    formatReadableSize(sumIf(bytes_on_disk, active = 1)) AS active_size
FROM system.parts
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 20;

Preventive Measures

  • Insert in batches of at least 10,000 rows per partition per INSERT, or enable async_insert for client-side aggregation.
  • Design partition keys to produce hundreds of partitions per table — not thousands or tens of thousands.
  • Monitor MaxPartCountForPartition continuously and alert at 200. By the time it reaches 800 (near the default parts_to_delay_insert of 1000), you are already in trouble.
  • Size nodes with 30–50% of cores budgeted for background merges at sustained ingest rates. Merges are not optional overhead — they are the write path.
  • Avoid OPTIMIZE TABLE ... FINAL in production. It bypasses the scheduler and can saturate IO for large tables.
  • On replicated tables, consider setting execute_merges_on_single_replica_time_threshold to pin merge work to a dedicated replica, leaving other replicas free for query traffic.

Resolve Merge Pressure Automatically with Pulse

Pulse monitors MaxPartCountForPartition, background pool utilization, and INSERT throttling metrics continuously across your ClickHouse clusters. When merge lag builds, Pulse surfaces the affected tables and partitions, the insert rate producing the backlog, and the specific settings changes most likely to resolve it — without requiring manual queries against system tables. Visit pulse.support to learn more.

Frequently Asked Questions

Q: Is it normal for ClickHouse to use high CPU with no queries running?
A: Yes. Background merges are continuous and CPU-intensive. On a server with active ingestion, sustained multi-core CPU usage from merges is expected and correct. The concern is when merges are so aggressive that query latency suffers — that is a sizing or tuning problem, not a sign that merges should be disabled.

Q: How do I know if merges are competing with my queries?
A: Correlate query latency from system.query_log with merge pool utilization from system.metrics over the same time window. If BackgroundMergesAndMutationsPoolTask is at ceiling when p99 query latency spikes, the pool is crowding out query threads. Reduce background_pool_size or move write-heavy tables to a dedicated replica.

Q: Should I set background_pool_size to a small value to protect query performance?
A: Only with caution. Reducing the pool protects query CPU at the cost of letting parts accumulate. If parts accumulate past parts_to_delay_insert (default 1000 per partition since version 23.6), INSERTs begin incurring forced delays. Past parts_to_throw_insert (default 3000), INSERTs are rejected entirely. The right fix is to reduce the part creation rate (larger batches, fixed partition key) rather than starving the merge pool.

Q: What is the difference between parts_to_delay_insert and parts_to_throw_insert?
A: parts_to_delay_insert (default 1000) is the active-parts-per-partition threshold at which ClickHouse starts injecting an artificial delay into each INSERT, computed as a fraction of max_delay_to_insert. parts_to_throw_insert (default 3000) is the threshold at which INSERTs are rejected outright with Too many parts (N). Merges are processing significantly slower than inserts. Both defaults were raised from 150 and 300 respectively in version 23.6.

Q: Can I see which merge algorithm was used?
A: Yes. system.merges has a merge_algorithm column showing Horizontal or Vertical for currently running merges. For historical merges, check system.part_log where event_type = 'MergeParts' — the merge_algorithm column is present there too. The vertical algorithm activates only when the merging parts have at least 131,072 total rows and the table has at least 11 non-primary-key columns; it is more memory-efficient for wide tables because it processes columns sequentially rather than all at once.

Q: Will enabling async inserts hide insert failures?
A: By default, async_insert with wait_for_async_insert = 1 (the default) blocks the client until the buffer is flushed and the data is written, so the client receives errors normally. Setting wait_for_async_insert = 0 makes inserts fire-and-forget, which increases throughput but means failures are only visible in server logs and system.query_log. Choose based on whether your application can tolerate silent insert failures.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.