ClickHouse Too Many Parts and Merge Backlog

Q: How do I know if the merge pool is the bottleneck vs. the insert rate?

Query system.metrics for BackgroundMergesAndMutationsPoolTask and BackgroundMergesAndMutationsPoolSize . If the task count equals the pool size, the pool is saturated and adding merge capacity (larger background_pool_size ) will help. If the pool is not saturated but parts are still accumulating, the bottleneck is on the insert side: inserts are creating parts faster than the (unsaturated) pool chooses to merge them, typically because many small parts are being created across many partitions simultaneously.

When inserts outpace background merges in ClickHouse, the server raises error code 252 with the message:

DB::Exception: Too many parts (N). Merges are processing significantly slower than inserts.

Before reaching the hard rejection threshold, ClickHouse silently throttles inserts with artificial delays. By the time inserts are rejected, the merge backlog is already large enough to affect query performance. This article explains the mechanics behind part accumulation, how to diagnose the severity, and how to resolve the root cause rather than just raising thresholds.

What This Error Means

ClickHouse's MergeTree engine writes every INSERT statement as a new immutable data part on disk. Background threads then merge smaller parts into larger ones to improve compression and query efficiency. The error fires when active parts in a single partition exceed the parts_to_throw_insert threshold.

Two distinct thresholds govern this process at the partition level:

parts_to_delay_insert -- When active parts in a partition exceed this value, ClickHouse introduces an artificial delay before each insert completes. The delay grows linearly as part count approaches the throw threshold. With the default settings (max_delay_to_insert = 1 second, min_delay_to_insert_ms = 10 ms), the formula is: delay_ms = max(10, 1000 * (parts_count - parts_to_delay_insert + 1) / (parts_to_throw_insert - parts_to_delay_insert)).
parts_to_throw_insert -- When active parts in a partition exceed this value, the insert is rejected outright with error code 252.

These thresholds changed significantly in ClickHouse 23.6. Before 23.6, the defaults were 150 (delay) and 300 (throw). From 23.6 onward, the defaults are 1000 (delay) and 3000 (throw). Many community resources and older blog posts still cite the pre-23.6 values, so always check your ClickHouse version before interpreting threshold-related advice.

A separate setting, max_parts_in_total (default: 100,000), applies across all partitions of a table combined. Either limit can trigger the error.

Common Causes

High-frequency small inserts. Each INSERT statement creates at least one new part per affected partition. Applications that insert row by row, in tiny micro-batches, or that call INSERT once per event will generate parts faster than any merge pool can consolidate them. This is the most common cause.
Over-partitioning. ClickHouse never merges parts across partition boundaries. A table partitioned by hour or by tenant ID may have hundreds or thousands of distinct partitions. Because each partition accumulates its own parts independently, the 3,000-part throw limit is effectively multiplied by the number of partitions. A table with 500 hourly partitions can hold up to 1.5 million unmerged parts before any single partition throws. At that scale, merge throughput cannot keep up. Daily partitioning (toYYYYMMDD) and especially hourly partitioning are leading causes of too-many-parts incidents. Monthly partitioning (toYYYYMM) is the safer default for most time-series tables.
Insufficient merge throughput. The background merge pool is controlled by background_pool_size (default: 16 threads) and background_merges_mutations_concurrency_ratio (default: 2), which together allow up to 32 concurrent merge/mutation tasks. On servers with many tables or heavy mutation workloads, the pool can become fully saturated, allowing part counts to grow unchecked.
Long-running queries holding inactive parts. After a merge completes, the source parts become inactive and are eventually deleted. Long-running queries can hold references to inactive parts, delaying deletion. When inactive part counts grow, inserts are slowed or blocked if inactive_parts_to_delay_insert or inactive_parts_to_throw_insert are set (both default to 0, meaning this check is disabled unless explicitly configured).
Large mutations blocking merge resources. ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE mutations consume merge pool threads. A wide-scope mutation on a large table can saturate the pool and prevent normal part merging for the duration of the mutation.
Replication lag. For ReplicatedMergeTree tables, merges are coordinated through ClickHouse Keeper (or ZooKeeper). If a replica falls behind or Keeper is under load, merge tasks queue up in system.replication_queue without executing, allowing parts to accumulate on lagging replicas.

How to Fix

1. Batch inserts more aggressively

Target at least 10,000 rows per insert, and ideally 100,000 rows or 10 MB of data. Fewer, larger inserts directly reduce the part creation rate. This is the highest-leverage fix and should always be the first change applied.

2. Enable async inserts for high-frequency write workloads

If your application sends many small inserts and cannot be easily re-batched, async_insert = 1 tells ClickHouse to buffer incoming data server-side and flush it as a single part when the buffer reaches async_insert_max_data_size (default: 100 MiB) or async_insert_busy_timeout_ms (default: 200 ms self-hosted, 1000 ms on ClickHouse Cloud). Since ClickHouse 24.2, adaptive timeouts are enabled by default (async_insert_use_adaptive_busy_timeout = 1), ranging from 50 ms to 200 ms based on incoming rate.

SET async_insert = 1;
SET wait_for_async_insert = 1;
SET async_insert_max_data_size = 104857600;  -- 100 MiB

Keep wait_for_async_insert = 1 (the default). Setting it to 0 means data buffered in memory is lost if the server crashes before the flush completes.

Note: ClickHouse Cloud enables async_insert = 1 by default for all users. Self-hosted ClickHouse has it disabled by default.

Also note that async inserts do not fully insulate against over-partitioning. A single buffer flush that touches multiple partition key values still creates one part per partition, so a high-cardinality partitioning key degrades async insert efficiency as well.

3. Coarsen the partition key

Switching from daily to monthly partitioning, or removing a high-cardinality secondary partition dimension, can eliminate the partition-multiplication effect entirely. This requires creating a new table with the revised schema and migrating data.

-- Instead of: PARTITION BY toYYYYMMDD(event_time)
-- Use:        PARTITION BY toYYYYMM(event_time)

Aim for a total number of distinct partitions in the low hundreds at most. Thousands of distinct partition values is a clear indicator of over-partitioning.

4. Increase the background merge pool

If CPU and I/O headroom exists, increase background_pool_size in config.xml or config.d/:

<clickhouse>
    <background_pool_size>32</background_pool_size>
</clickhouse>

This can be changed at runtime (the change takes effect without restart), but reducing it below the current value requires a restart.

Also consider increasing the maximum part size eligible for merging to help clear a large backlog:

ALTER TABLE my_database.my_table
MODIFY SETTING max_bytes_to_merge_at_max_space_in_pool = 322122547200;  -- 300 GiB

5. Force a targeted merge as an emergency measure

If a specific partition has accumulated too many parts and inserts are already failing, OPTIMIZE TABLE ... PARTITION ... FINAL forces an immediate full merge of all parts in that partition:

OPTIMIZE TABLE my_database.my_table PARTITION '202501' FINAL;

Use this with caution. OPTIMIZE TABLE ... FINAL merges all parts in a partition into a single part, which is CPU- and disk-intensive. It blocks other merges for the affected table during its execution and should be treated as an emergency intervention, not a routine operation.

6. Temporarily raise thresholds during a backlog catch-up

Raising parts_to_throw_insert gives the background merge pool time to catch up without rejecting new inserts. This is a stopgap only and does not address the root cause:

ALTER TABLE my_database.my_table
MODIFY SETTING
    parts_to_delay_insert = 2000,
    parts_to_throw_insert = 5000;

Once the backlog clears and the root cause is fixed, revert these settings.

Root-Cause Analysis

Use the following queries to identify which tables and partitions are affected and why.

Find the worst-offending partitions:

SELECT
    database,
    table,
    partition,
    count() AS part_count,
    sum(rows) AS total_rows,
    formatReadableSize(sum(data_compressed_bytes)) AS total_compressed
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
HAVING part_count > 50
ORDER BY part_count DESC
LIMIT 20;

Check whether the merge pool is fully saturated:

SELECT
    metric,
    value
FROM system.metrics
WHERE metric IN (
    'BackgroundMergesAndMutationsPoolTask',
    'BackgroundMergesAndMutationsPoolSize'
);

When BackgroundMergesAndMutationsPoolTask equals BackgroundMergesAndMutationsPoolSize, the pool is at capacity and new merges are queuing.

View currently running merges:

SELECT
    database,
    table,
    partition_id,
    round(progress * 100, 1) AS progress_pct,
    elapsed,
    num_parts,
    is_mutation,
    formatReadableSize(total_size_bytes_compressed) AS compressed_size
FROM system.merges
ORDER BY elapsed DESC;

A merge that has been running for many minutes with the same progress value indicates a stuck or very large merge consuming pool resources.

Diagnose insert batch patterns from query log:

SELECT
    tables[1] AS table,
    count() AS insert_count,
    round(avg(written_rows)) AS avg_rows_per_insert,
    min(written_rows) AS min_rows,
    max(written_rows) AS max_rows
FROM system.query_log
WHERE event_date >= today()
  AND type = 'QueryFinish'
  AND query_kind = 'Insert'
GROUP BY table
ORDER BY insert_count DESC
LIMIT 20;

avg_rows_per_insert values below 1,000 combined with high insert_count is the classic signature of a too-small batching problem.

Count new parts created per table in the last hour:

SELECT
    table,
    count() AS new_parts,
    sum(rows) AS total_rows,
    formatReadableSize(sum(size_in_bytes)) AS total_size
FROM system.part_log
WHERE event_type = 'NewPart'
  AND event_time > now() - INTERVAL 1 HOUR
GROUP BY table
ORDER BY new_parts DESC;

Track the maximum parts-per-partition metric over time:

SELECT
    toStartOfMinute(event_time) AS ts,
    max(value) AS max_parts_per_partition
FROM system.asynchronous_metric_log
WHERE metric = 'MaxPartCountForPartition'
  AND event_time > now() - INTERVAL 1 DAY
GROUP BY ts
ORDER BY ts ASC;

This query shows when part accumulation began and how fast it is growing. MaxPartCountForPartition from system.asynchronous_metrics is also the right metric to use for alerting.

For replicated tables, check for stuck merge tasks:

SELECT
    database,
    table,
    type,
    num_tries,
    last_attempt_time,
    last_exception
FROM system.replication_queue
WHERE type = 'MERGE_PARTS'
ORDER BY num_tries DESC
LIMIT 20;

High num_tries values or entries with last_exception populated indicate merges that are failing to execute on this replica.

Check the level distribution of parts (merge depth):

SELECT
    database,
    table,
    partition,
    level,
    count() AS part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition, level
ORDER BY database, table, partition, level ASC;

Parts at level = 0 were created directly by an INSERT and have never been merged. A partition with many level = 0 parts is not being merged. Higher levels indicate healthy merge activity.

Preventive Measures

Target 10,000 to 100,000 rows per INSERT statement or 1 MB to 100 MB of uncompressed data per insert.
Keep the number of distinct partition values in the low hundreds. Avoid partitioning by tenant ID, user ID, or any other high-cardinality field unless combined with a lower-cardinality key.
Prefer toYYYYMM over toYYYYMMDD for time-series tables with frequent small inserts.
Alert on MaxPartCountForPartition from system.asynchronous_metrics. Set a warning threshold at 300 and a critical threshold at 1,000 (on ClickHouse 23.6+) to catch accumulation before inserts are rejected.
Use system.part_log to track the NewPart event rate over time and correlate spikes with deployment or traffic events.
On self-hosted deployments, enable async_insert at the user profile level for applications that cannot batch client-side.
Keep background_pool_size proportional to available CPU cores. A 32-core server running heavy ingest workloads may benefit from background_pool_size = 32 or higher.
Avoid wide-scope mutations during peak ingestion hours. Schedule ALTER TABLE ... UPDATE/DELETE operations during low-traffic windows.

Resolve Too Many Parts Automatically with Pulse

Pulse monitors MaxPartCountForPartition, insert throttling events, and merge pool saturation in real time, and alerts before part counts reach rejection thresholds. Pulse surfaces the specific tables and partitions accumulating parts, identifies whether the cause is insert batch size, over-partitioning, or a saturated merge pool, and recommends the correct remediation step. Rather than waiting for inserts to fail, Pulse gives you early warning and actionable context so you can fix the root cause on your schedule.

Frequently Asked Questions

Q: The error says "N parts" but my parts_to_throw_insert is set to 3000. Why am I seeing this at a lower count?
A: Check your ClickHouse version. Before 23.6, the defaults were 150 (delay) and 300 (throw). If you are on an older version, or if the table was created with explicit settings inherited from an older configuration, the lower pre-23.6 thresholds may still apply. Query SELECT engine_full FROM system.tables WHERE name = 'your_table' to see the table's active settings.

Q: Can I just raise parts_to_throw_insert to a very high value to stop the errors?
A: This stops the error but makes the underlying problem worse. More active parts means slower queries (each query must open and read more files), higher memory usage during queries, and more work for the background merge pool. The correct fix is to reduce the part creation rate. Raise the threshold only as a temporary measure while you fix the insert pattern.

Q: My table has only one partition but still hits too many parts. How?
A: A single-partition table still accumulates parts. If your insert rate is high enough, even one partition can exceed 3,000 active parts. This is a pure insert-rate vs. merge-rate problem. Increase batch size, increase background_pool_size, or enable async_insert.

Q: Does async insert eliminate the too-many-parts problem?
A: It reduces it significantly for workloads that send many small inserts, because the server buffers data and creates fewer, larger parts. But it does not eliminate the problem if the table is over-partitioned. A single async buffer flush that writes to 500 partitions still creates 500 parts. Fix the partitioning key alongside enabling async inserts.

Q: Is OPTIMIZE TABLE FINAL safe to run on a production table?
A: It is safe in the sense that it does not corrupt data, but it is expensive. It merges all parts in each partition into a single large part, consuming significant CPU, disk read/write bandwidth, and memory. On large tables it can run for hours and compete with normal merge activity. Use OPTIMIZE TABLE ... PARTITION 'id' FINAL to target only the affected partition, and run it during a low-traffic window.

Q: How do I know if the merge pool is the bottleneck vs. the insert rate?
A: Query system.metrics for BackgroundMergesAndMutationsPoolTask and BackgroundMergesAndMutationsPoolSize. If the task count equals the pool size, the pool is saturated and adding merge capacity (larger background_pool_size) will help. If the pool is not saturated but parts are still accumulating, the bottleneck is on the insert side: inserts are creating parts faster than the (unsaturated) pool chooses to merge them, typically because many small parts are being created across many partitions simultaneously.