NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse S3 Cache: Configuration and Performance

A local cache in front of an S3 disk can change a ClickHouse cluster's performance profile from "S3 is too slow for interactive queries" to "S3 is fine for the working set." This article walks through the cache configuration, shows measured performance against EBS and uncached S3, and covers the migration patterns and startup tuning you need when moving real data to S3.

Storage Configuration

The cache is itself a disk type that wraps an underlying S3 disk. A complete configuration with a tiered policy:

<clickhouse>
    <storage_configuration>
        <disks>
            <s3disk>
                <type>s3</type>
                <endpoint>https://s3.us-east-1.amazonaws.com/mybucket/test/s3cached/</endpoint>
                <use_environment_credentials>1</use_environment_credentials>
            </s3disk>
            <cache>
                <type>cache</type>
                <disk>s3disk</disk>
                <path>/var/lib/clickhouse/disks/s3_cache/</path>
                <max_size>50Gi</max_size>
            </cache>
        </disks>
        <policies>
          <s3tiered>
              <volumes>
                  <default>
                      <disk>default</disk>
                      <max_data_part_size_bytes>50000000000</max_data_part_size_bytes>
                  </default>
                  <s3cached>
                      <disk>cache</disk>
                  </s3cached>
              </volumes>
          </s3tiered>
        </policies>
    </storage_configuration>
</clickhouse>

Key fields:

  • path is the local directory the cache uses. Place it on fast local storage (NVMe).
  • max_size is the cache budget. Beyond this, evictions occur on a least-recently-used basis.
  • The cache is configured per disk, so different S3 disks can have different cache budgets.

Tables that use the s3tiered policy place new parts on the default (local) volume and move parts to the s3cached volume per TTL or move rules.

Measured Performance

Comparing scan throughput in a controlled benchmark:

Configuration Throughput
Local EBS 87.42 million rows/s
S3, first access (cold cache) 37.59 million rows/s
S3, second access (warm cache) 115.27 million rows/s

The first S3 query is the slowest because data flows from S3 over the network. Once cached locally, subsequent queries can outperform plain EBS because the cache lives on NVMe instance storage. This is the pattern that makes S3 viable for interactive analytics: pay the slow first read, then enjoy fast repeats.

Migrating an Existing Table to S3

Two strategies are common when moving data from local storage to S3-backed storage.

Heavy: TTL Rewrite All At Once

Add a TTL rule that moves everything older than zero days to the S3 volume:

ALTER TABLE big_table
MODIFY TTL toDate(event_date) TO VOLUME 's3cached';

ClickHouse begins moving parts immediately. On the reference dataset this completed in just over two minutes (~140 seconds) but produces a burst of S3 PUTs and saturates the network. Use this only when the cluster has capacity to spare.

Gentle: Partition by Partition

Move one partition at a time, optionally optimizing first:

ALTER TABLE big_table MOVE PARTITION 202401 TO VOLUME 's3cached';
ALTER TABLE big_table MOVE PARTITION 202402 TO VOLUME 's3cached';
-- ...continue at your own pace

This spreads the network load and lets you pause between partitions. It is the right approach for clusters that serve production traffic during the migration.

Startup Performance Tuning

When a table has many parts on S3, ClickHouse startup time dominates because the server must read each part's metadata. With default settings and 1,000 parts on S3, a startup took 4 minutes 26 seconds. With max_part_loading_threads increased to 256, the same startup completed in 8.1 seconds.

max_part_loading_threads is a top-level server setting, not a per-user profile setting. Set it in config.xml (or a drop-in under config.d/):

<clickhouse>
  <max_part_loading_threads>256</max_part_loading_threads>
</clickhouse>

The right value depends on the number of parts, available CPU, and S3 request limits. Start at 64 and increase if startup is still slow.

Sizing the Cache

The cache budget should match the size of the working set you expect to be repeatedly queried, not the size of the table. For a 10 TB table where most queries hit the last 100 GB, a 200 GB cache will give near-local performance for typical queries. Oversizing the cache wastes local disk; undersizing causes eviction churn that defeats the purpose.

Inspect cache configuration via system.filesystem_cache_settings:

SELECT
    cache_name,
    path,
    formatReadableSize(max_size)     AS max,
    formatReadableSize(current_size) AS current,
    current_elements_num
FROM system.filesystem_cache_settings;

For runtime stats, system.metrics and system.events expose FilesystemCacheBytesRead and related counters such as CachedReadBufferReadFromCacheBytes and CachedReadBufferReadFromSourceBytes.

Common Pitfalls

  • Placing the cache directory on the same disk as the OS root. Cache evictions can cause IO contention with system processes.
  • Setting max_size too small. A cache that thrashes is worse than no cache because of the bookkeeping overhead.
  • Forgetting that the cache is per-server. In a multi-replica setup, each replica warms its own cache independently.
  • Triggering the heavy TTL migration during peak hours. The S3 write burst will saturate the network and impact queries.
  • Leaving max_part_loading_threads at the default value on a cluster with thousands of S3 parts. Startup may take longer than the failure-recovery SLA you promised.

Frequently Asked Questions

Q: Does the cache survive a restart? A: Yes, the cache is persistent. After restart ClickHouse rehydrates cache metadata, so warm queries stay warm.

Q: Is the cache shared between tables? A: Yes, a cache disk is shared by every MergeTree table whose storage policy includes it. The eviction policy is LRU across the shared pool.

Q: Can I disable the cache for specific queries? A: Yes, with the enable_filesystem_cache=0 query setting. Useful for one-off scans you do not want to populate the cache with.

Q: How does the cache interact with mutations? A: Mutations write new parts, which become candidates for caching on next read. Old part data in the cache is invalidated when the part is replaced.

Q: Is the cache safe with multiple ClickHouse processes? A: Each process must use its own cache directory. Sharing the same path across processes is unsupported and can corrupt the cache.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.