ClickHouse S3 Object Storage Merge Performance

Background merges slow to a crawl, part counts climb toward the Too many parts threshold, and query latency becomes unpredictable. These are the hallmarks of a ClickHouse deployment where merge operations are contending with S3 latency. Unlike local NVMe, S3 adds per-request overhead of tens to hundreds of milliseconds, and the default merge scheduler was designed for local disks. Understanding what drives merge cost on object storage — and how to reduce it — is essential for stable production deployments.

What This Error Means

ClickHouse MergeTree merges work by reading all source parts, merging the rows, and writing a new result part. On local storage this is a sequential read/write bound by disk bandwidth. On S3 each read and write is an HTTP request that carries its own connection and request overhead. A merge of ten 100 MB parts that takes a few seconds on NVMe may take several minutes on S3 because of this per-operation latency.

The downstream effects are:

  • Part counts accumulate faster than merges can reduce them, eventually triggering Too many parts insert throttling or rejection.
  • Replication lag grows because replica part fetches are also bound by S3 bandwidth.
  • Query latency becomes inconsistent: parts that have been cached on local SSD respond in milliseconds, while S3 reads hit the 99.99th-percentile latency of ~500 ms per range GET.
  • Background merge threads spend a disproportionate share of their time waiting on network I/O rather than doing compute work, reducing effective parallelism.

Common Causes

  1. All data lives on S3 with no local cache. Without a filesystem cache disk sitting in front of S3, every merge reads cold objects from the bucket. Repeated access to the same granules pays the full S3 latency each time.

  2. prefer_not_to_merge is enabled on a cold volume. The ClickHouse documentation explicitly states that enabling prefer_not_to_merge = 1 on a volume causes performance degradation. Parts accumulate on that volume indefinitely because merges are suppressed.

  3. Zero-copy replication is enabled in an older ClickHouse version. allow_remote_fs_zero_copy_replication was true by default in ClickHouse 22.7 and earlier. It is experimental, carries known data-loss bugs (GitHub issues #39560, #45346, #33643), and was changed to false by default in 22.8. Clusters running it in production may see CORRUPTED_DATA errors, missing count.txt files in S3, and stalled merges caused by metadata synchronization failures in ZooKeeper/Keeper.

  4. Hot data lands directly on S3 at insert time. When the storage policy places new parts directly onto S3 — rather than a local hot volume — perform_ttl_move_on_insert triggers immediate S3 writes during inserts. This increases insert latency and creates small parts on S3 that are expensive to merge.

  5. S3 request timeouts and retries are too aggressive. The default request_timeout_ms is 5000 ms. Under load or high-latency conditions, requests hit this timeout and consume all retry_attempts (default 10) before failing. The resulting backpressure stalls background merge threads waiting on I/O.

  6. Background pool starvation. The default background_pool_size of 16 threads is shared across all tables. If many tables have S3-backed parts, a small number of slow merges can occupy all threads, blocking merges for other tables.

How to Fix

1. Disable prefer_not_to_merge on all volumes

Check your storage policy configuration and remove or set to 0 any volume that has this flag. There is no valid production reason to disable merges on a volume.

SELECT
    policy_name,
    volume_name,
    prefer_not_to_merge
FROM system.storage_policies
WHERE prefer_not_to_merge = 1;

If any rows appear, update the XML configuration to remove the setting and reload the configuration without restarting:

SYSTEM RELOAD CONFIG;

2. Add a filesystem cache in front of the S3 disk

A local SSD cache intercepts repeated reads to the same granules, reducing S3 GET requests to near zero for warm data. Configure a cache type disk that wraps your existing S3 disk:

<clickhouse>
  <storage_configuration>
    <disks>
      <s3>
        <type>s3</type>
        <endpoint>https://s3.us-east-1.amazonaws.com/my-bucket/data/</endpoint>
        <access_key_id>ACCESS_KEY_ID</access_key_id>
        <secret_access_key>SECRET_ACCESS_KEY</secret_access_key>
      </s3>
      <s3_cached>
        <type>cache</type>
        <disk>s3</disk>
        <path>/var/lib/clickhouse/s3_cache/</path>
        <max_size>107374182400</max_size>
        <cache_on_write_operations>true</cache_on_write_operations>
      </s3_cached>
    </disks>
    <policies>
      <s3_with_cache>
        <volumes>
          <main>
            <disk>s3_cached</disk>
          </main>
        </volumes>
      </s3_with_cache>
    </policies>
  </storage_configuration>
</clickhouse>

Set max_size to the available SSD capacity you want to dedicate to caching. Monitor cache effectiveness with:

SHOW FILESYSTEM CACHES;

SELECT
    cache_name,
    formatReadableSize(sum(file_size)) AS cached_size,
    count() AS segment_count
FROM system.filesystem_cache
GROUP BY cache_name;

3. Keep recent data on a local hot volume, move to S3 only after merges complete

Structure your storage policy so new inserts land on local NVMe, and data moves to S3 after it has been fully merged and is no longer likely to participate in another merge. Use max_data_part_size_bytes on the local volume and TTL rules to control movement:

<policies>
  <hot_to_cold>
    <volumes>
      <hot>
        <disk>local_nvme</disk>
        <max_data_part_size_bytes>5368709120</max_data_part_size_bytes>
      </hot>
      <cold>
        <disk>s3</disk>
      </cold>
    </volumes>
    <move_factor>0.2</move_factor>
  </hot_to_cold>
</policies>

With the corresponding TTL rule:

CREATE TABLE events
(
    event_date DateTime,
    user_id UInt64,
    payload String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)
TTL event_date + INTERVAL 30 DAY TO VOLUME 'cold',
    event_date + INTERVAL 365 DAY DELETE
SETTINGS storage_policy = 'hot_to_cold';

This avoids the expensive read-from-S3 / write-back-to-S3 cycle during background merges on recent data.

4. Disable zero-copy replication if it is enabled

Check whether any table has zero-copy replication active:

SELECT name, value, changed
FROM system.replicated_merge_tree_settings
WHERE name LIKE '%zero_copy%';

If allow_remote_fs_zero_copy_replication is 1 and you are not running ClickHouse Cloud with SharedMergeTree, disable it. SharedMergeTree (ClickHouse Cloud only) is the production-grade replacement for zero-copy replication; it stores metadata in ClickHouse Keeper and avoids the known stability issues. For self-hosted clusters, set the flag to 0 in the server merge_tree config section and also at the table level for any existing tables:

<merge_tree>
    <allow_remote_fs_zero_copy_replication>false</allow_remote_fs_zero_copy_replication>
</merge_tree>
ALTER TABLE my_replicated_table
    MODIFY SETTING allow_remote_fs_zero_copy_replication = 0;

5. Reduce duplicate merge work on replicas using the single-replica merge threshold

When multiple replicas share an S3 bucket with allow_remote_fs_zero_copy_replication = 1 (experimental), set remote_fs_execute_merges_on_single_replica_time_threshold to a positive value so only one replica performs the merge immediately and others wait before independently merging the same parts:

ALTER TABLE my_table
    MODIFY SETTING remote_fs_execute_merges_on_single_replica_time_threshold = 10800;

This setting only has effect when zero-copy replication is enabled and shared storage is detected.

6. Tune S3 disk timeouts for high-latency conditions

If merges are frequently stalling with I/O errors, increase the S3 request timeout and reduce retry aggressiveness to avoid thread starvation:

<s3>
    <connect_timeout_ms>10000</connect_timeout_ms>
    <request_timeout_ms>30000</request_timeout_ms>
    <retry_attempts>5</retry_attempts>
    <single_read_retries>4</single_read_retries>
    <min_bytes_for_seek>1048576</min_bytes_for_seek>
</s3>

Increasing request_timeout_ms beyond the default 5000 ms allows large S3 transfers to complete before the timeout fires.

Root-Cause Analysis

Use these queries to identify which tier is causing problems before changing configuration.

Check part distribution across storage tiers:

SELECT
    disk_name,
    count() AS part_count,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
    sum(rows) AS total_rows
FROM system.parts
WHERE active = 1
  AND database = 'my_db'
  AND table = 'my_table'
GROUP BY disk_name
ORDER BY compressed_size DESC;

Monitor active background merges and their progress:

SELECT
    database,
    table,
    round(elapsed, 1) AS elapsed_s,
    progress,
    num_parts,
    result_part_name,
    formatReadableSize(total_size_bytes_compressed) AS compressed_size
FROM system.merges
ORDER BY elapsed DESC
LIMIT 20;

Merges with elapsed_s in the hundreds of seconds on S3-backed parts confirm S3 latency is the bottleneck.

Check active part moves triggered by TTL or move_factor:

SELECT
    database,
    table,
    part_name,
    formatReadableSize(part_size) AS size,
    target_disk_name,
    elapsed
FROM system.moves
ORDER BY elapsed DESC;

Inspect storage policy settings:

SELECT
    policy_name,
    volume_name,
    volume_priority,
    disks,
    move_factor,
    max_data_part_size,
    prefer_not_to_merge,
    perform_ttl_move_on_insert
FROM system.storage_policies
ORDER BY policy_name, volume_priority;

Check S3-related events and errors:

SELECT event, value
FROM system.events
WHERE event LIKE '%S3%' OR event LIKE '%RemoteFS%'
ORDER BY value DESC
LIMIT 30;

Look for S3WriteRequestsErrors, S3ReadRequestsErrors, and RemoteFSReadBytes growing faster than expected.

Preventive Measures

  • Use the hot_to_cold tiered storage pattern so merges on active data always run against local disk.
  • Size the SSD filesystem cache to cover at least 10-20% of the total dataset; adjust based on cache hit rates observed in system.filesystem_cache.
  • Never enable prefer_not_to_merge on any production volume.
  • Do not enable allow_remote_fs_zero_copy_replication on self-hosted clusters running ClickHouse 22.8 or later without first understanding the documented stability issues and testing thoroughly in staging.
  • Batch inserts to produce parts of at least 50-100 MB before they land on S3; small parts create a disproportionate number of S3 API calls per merged byte.
  • Use s3_plain_rewritable (available since 24.4) only for non-replicated tables such as system tables or single-node deployments; it does not support mutations or replication.
  • Monitor system.merges in production and alert when any single merge has been running for more than 30 minutes — this is a reliable early warning of S3 contention.

Resolve S3 Merge Performance Issues Automatically with Pulse

Pulse continuously monitors your ClickHouse background merge activity, S3 request error rates, and part accumulation trends. When merge throughput falls behind insert rate or S3 error counters spike, Pulse surfaces the root cause — whether it is a misconfigured prefer_not_to_merge, an undersized filesystem cache, or zero-copy replication instability — along with actionable remediation steps. Rather than discovering the problem when insert throttling starts, you get an alert with context the moment merge health degrades.

Frequently Asked Questions

Q: Is zero-copy replication (allow_remote_fs_zero_copy_replication) safe to use in production?
A: No. It is explicitly marked experimental in ClickHouse documentation and has documented data-loss and corruption issues tracked in multiple open GitHub issues. It was enabled by default only in ClickHouse 22.7 and earlier; since 22.8 the default is false. For production S3 replication on self-hosted clusters, use standard ReplicatedMergeTree with a hot-to-cold tiered storage policy. On ClickHouse Cloud, SharedMergeTree provides the production-grade equivalent.

Q: Why do merges complete quickly on local disk but take many minutes on S3?
A: Each read or write operation during a merge is an HTTP request to S3 with its own TCP connection and request latency. S3 99.99th-percentile read latency is roughly 500 ms, versus roughly 1 ms for local NVMe. A merge involving dozens of S3 GET and PUT requests multiplies that overhead. Keeping active/recent data on local storage and only storing cold, fully-merged data on S3 eliminates this bottleneck for the majority of merge work.

Q: What is prefer_not_to_merge and when should I use it?
A: prefer_not_to_merge is a volume-level setting that disables background merges for parts stored on that volume. The ClickHouse documentation states it causes performance degradation and there is effectively no production use case where it should be enabled. It was sometimes recommended in older guides to avoid expensive S3 re-merges on cold data; the correct solution is the hot-to-cold tiered storage pattern instead.

Q: How do I know whether the filesystem cache is actually being used?
A: Run SHOW FILESYSTEM CACHES; to see configured caches, then query system.filesystem_cache to see current size and segment count. To get per-query cache hit/miss detail, set enable_filesystem_cache_log = 1 at the session level and inspect system.filesystem_cache_log after running a representative query.

Q: What is the difference between s3, s3_plain, and s3_plain_rewritable disk types?
A: s3 is the standard type: metadata is stored locally, full MergeTree functionality is supported including merges, mutations, and replication. s3_plain is write-once with no local metadata — it does not support merges or mutations and is suitable for static datasets. s3_plain_rewritable (introduced in 24.4) stores metadata in S3 rather than locally, supports merges and inserts, but does not support mutations or replication, making it appropriate only for non-replicated single-node tables.

Q: Should I increase background_pool_size to speed up S3 merges?
A: Increasing the thread count (background_pool_size, default 16) gives more parallelism but each thread still blocks on S3 I/O. A better investment is reducing the per-merge S3 cost via local caching and the hot-to-cold tiering pattern. If after those changes merge backlog persists, then increasing background_pool_size can help at the cost of higher memory and CPU usage. Separately, background_move_pool_size (default 8) controls threads for TTL-triggered part moves between volumes and can be increased if TTL moves are a bottleneck.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.