ClickHouse S3Disk: Storage Policy and Metadata Restore

Q: Can I move data between two S3 disks within the same cluster?

Yes, via ALTER TABLE ... MOVE PARTITION ... TO DISK 'other_s3' . The move copies objects between buckets at S3 speeds.

Q: What happens if I delete the local metadata while the table is live?

The table breaks immediately. Restart with a restore file (and send_metadata enabled at write time) to rebuild it. Without send_metadata the data in the bucket becomes orphaned and the table cannot be recovered.

The S3Disk type in ClickHouse stores MergeTree parts in an S3-compatible bucket while keeping small metadata files on local disk. The metadata maps logical part files to S3 object keys, so the local filesystem stays compact while data lives in the bucket. This article covers the storage policy configuration, the two settings that have the largest operational impact (skip_access_check and send_metadata), and how to recover when local metadata is lost.

Core Configuration

A working S3Disk definition that includes the settings most users care about:

<clickhouse>
  <storage_configuration>
    <disks>
      <s3>
        <type>s3</type>
        <endpoint>https://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
        <access_key_id>ACCESS_KEY_ID</access_key_id>
        <secret_access_key>SECRET_ACCESS_KEY</secret_access_key>
        <skip_access_check>true</skip_access_check>
        <send_metadata>true</send_metadata>
      </s3>
    </disks>
  </storage_configuration>
</clickhouse>

Define a storage policy that uses the disk:

<policies>
  <s3_only>
    <volumes>
      <main>
        <disk>s3</disk>
      </main>
    </volumes>
  </s3_only>
</policies>

A table that uses the policy:

CREATE TABLE events (ts DateTime, payload String)
ENGINE = MergeTree()
ORDER BY ts
SETTINGS storage_policy = 's3_only';

For AWS, prefer use_environment_credentials over inline keys so the IAM role attached to the instance or pod handles authentication.

skip_access_check

By default ClickHouse verifies it can write and delete on the bucket at startup. skip_access_check=true bypasses this probe. Two situations call for it:

Read-only credentials. If the IAM policy grants only s3:GetObject and s3:ListBucket, the write probe fails and the disk fails to load. With skip_access_check=true the disk loads and serves read-only queries.
Slow bucket policies. On buckets with complex policies the probe can take noticeable time at startup.

When using a read-only disk, also disable merges on the volume by setting prefer_not_to_merge on the volume so background activity does not attempt writes:

<volumes>
  <main>
    <disk>s3</disk>
    <prefer_not_to_merge>true</prefer_not_to_merge>
  </main>
</volumes>

send_metadata

send_metadata=true causes ClickHouse to embed the local metadata path inside each S3 object as user metadata. This costs a small extra header on every PUT but gives you a way to reconstruct the local metadata directory from the bucket alone.

Enable this from the start if you ever want the option to recover from a wiped local disk. Enabling it later means only new objects carry the metadata, which limits the restore.

Metadata Recovery

When local metadata is lost (a wiped disk, a corrupted volume, or a new replica), ClickHouse can rebuild it from the bucket if send_metadata was active when the objects were written.

Standard Restoration

Create an empty restore file in the disk metadata directory:

touch /var/lib/clickhouse/disks/s3/restore

Restart ClickHouse. The server reads object metadata from the bucket, recreates the local mapping files, and the tables come back online. This path requires read-write access to the bucket because metadata files in S3 are updated during the restore.

Custom Restoration

To restore into a different bucket or path, populate the restore file with the source configuration:

source_bucket=s3disk
source_path=vol1/

ClickHouse reads objects from source_bucket/source_path and copies them into the disk's configured endpoint. This needs read-only access to the source and read-write access to the destination. Useful for promoting a backup bucket to live or migrating between buckets.

Validating the Setup

After defining the disk, confirm ClickHouse loaded it:

SELECT name, type, path, free_space, total_space
FROM system.disks
WHERE name = 's3';

And that the policy is visible:

SELECT policy_name, volume_name, disks
FROM system.storage_policies
WHERE policy_name = 's3_only';

A round-trip test:

CREATE TABLE t (n UInt64)
ENGINE = MergeTree() ORDER BY n
SETTINGS storage_policy = 's3_only';

INSERT INTO t SELECT * FROM numbers(1000000);
SELECT count(), sum(n) FROM t;
DROP TABLE t;

If any step fails, the ClickHouse log shows the underlying AWS error code (403, 404, etc.) which usually points at IAM or endpoint configuration.

Common Pitfalls

Setting send_metadata=true only after writing data and expecting a full restore to work. Only objects written while the flag was active carry the metadata.
Using skip_access_check=true to mask real permission problems. The write probe exists for a reason; if it fails, fix the IAM policy unless you genuinely want a read-only disk.
Restoring into the same bucket the live cluster is using. The restore process writes metadata back to S3 and can race with active reads on a different replica. Restore into an isolated environment first.
Wrong endpoint format. AWS S3 wants https://s3.<region>.amazonaws.com/<bucket>/<prefix>/. Trailing slash matters.
Putting credentials in the disk config and committing them to source control. Use IAM roles or external secret stores.

Frequently Asked Questions

Q: Do I need to enable send_metadata? A: Only if you want to be able to rebuild local metadata from the bucket. The cost is a small per-object overhead. Most operators turn it on for any production cluster.

Q: Can the same bucket back multiple ClickHouse clusters? A: Yes, but give each cluster its own prefix. Sharing a prefix across clusters causes overlapping metadata and inconsistent reads.

Q: How do I size the local disk that holds S3 metadata? A: Metadata files are small but proportional to the number of parts. Plan for tens of KB per part. For most clusters a few hundred GB of local space is plenty.

Q: Can I move data between two S3 disks within the same cluster? A: Yes, via ALTER TABLE ... MOVE PARTITION ... TO DISK 'other_s3'. The move copies objects between buckets at S3 speeds.

Q: What happens if I delete the local metadata while the table is live? A: The table breaks immediately. Restart with a restore file (and send_metadata enabled at write time) to rebuild it. Without send_metadata the data in the bucket becomes orphaned and the table cannot be recovered.