ClickHouse supports object storage like Amazon S3 in two distinct ways: as a table engine that reads and writes files in a bucket, and as a disk that stores MergeTree parts behind a storage policy. Both modes are widely used in production, but they solve different problems. This guide focuses on the operational side of running ClickHouse against S3: configuration, performance characteristics, and the trade-offs that matter when designing a cluster.
Two Ways to Use S3
The S3 table engine and the s3() table function let you query files in a bucket directly. They are convenient for ingestion, ad hoc analysis, and integrating with data lakes. For background on this mode see the article on the ClickHouse S3 engine.
The s3 disk type, configured under storage_configuration, makes a bucket act as a backing store for a regular MergeTree table. From SQL the table looks the same as a local table, but parts live in S3 and the local filesystem holds only small metadata files. This is the right tool for tiered storage and for clusters where data volume exceeds local capacity.
Configuring an S3 Disk
A minimal disk and storage policy definition:
<clickhouse>
<storage_configuration>
<disks>
<s3>
<type>s3</type>
<endpoint>https://s3.us-east-1.amazonaws.com/BUCKET/path/</endpoint>
<use_environment_credentials>true</use_environment_credentials>
</s3>
</disks>
<policies>
<s3_only>
<volumes>
<main>
<disk>s3</disk>
</main>
</volumes>
</s3_only>
</policies>
</storage_configuration>
</clickhouse>
Create a table that uses the policy:
CREATE TABLE events_s3 (id UInt64, ts DateTime, payload String)
ENGINE = MergeTree()
ORDER BY (ts, id)
SETTINGS storage_policy = 's3_only';
For AWS deployments, prefer IAM roles over static access keys. With use_environment_credentials=true ClickHouse picks up the credentials from the EC2 instance profile, EKS IRSA service account, or environment variables.
Tiered Storage Patterns
A common production layout keeps recent data on local NVMe and moves older parts to S3 with a move_factor or TTL rule:
<policies>
<hot_cold>
<volumes>
<hot>
<disk>default</disk>
<max_data_part_size_bytes>50000000000</max_data_part_size_bytes>
</hot>
<cold>
<disk>s3</disk>
</cold>
</volumes>
<move_factor>0.2</move_factor>
</hot_cold>
</policies>
ALTER TABLE events
MODIFY TTL ts + INTERVAL 7 DAY TO VOLUME 'cold';
ClickHouse moves whole parts between volumes, so the granularity is the part, not the row. Pick a partition key that aligns with your aging policy.
Performance Characteristics
S3 latency is two orders of magnitude higher than local SSD, and throughput per request is bounded. The implications:
- Point queries that hit many small files perform poorly. Keep parts large by tuning merges and avoiding tiny inserts.
- Range scans benefit from prefetch and parallel reads. Increase
max_threadsand checks3_max_get_rpsands3_max_get_burstif you hit throttling. - The first scan over an S3-resident part is slow; subsequent scans hit the page cache. Adding a local cache disk in front of S3 (see the S3 cache configuration article) significantly improves repeat queries.
- Mutations that touch non-indexed columns rewrite full columns to S3, which causes large data transfer. See the article on S3 mutation behavior.
Cost Trade-offs
Three line items dominate the bill: storage (per GB-month), request charges (per 1,000 GET, PUT, LIST), and data transfer out of the region. The request component often surprises operators because background merges issue many small operations.
Mitigations that work in practice:
- Use a single-region setup. Cross-region transfer dwarfs storage cost.
- Keep parts large via TTL-aware partitioning and
parts_to_throw_inserttuning. - Enable a local cache for hot queries to cut GETs.
- Disable redundant lifecycle rules; let ClickHouse manage object lifetime.
Cluster and Replication Concerns
Two replicas writing the same S3 path is supported but each replica still tracks its own metadata. For zero-copy replication (when enabled), replicas share the underlying S3 objects and only one copy of the data is stored. Zero-copy has had stability caveats across versions; test thoroughly on your ClickHouse version before relying on it.
If a replica is rebuilt, it can download data from S3 directly rather than streaming from a peer, which is faster than full replication over the network.
Common Pitfalls
- Using path-style URLs against a bucket that requires virtual-hosted-style addressing, or the reverse.
- Forgetting to grant
s3:ListBucketin addition toGetObjectandPutObject. Listing is required for many operations. - Setting an aggressive TTL that moves data to S3 faster than merges can keep parts large, producing many small objects.
- Dropping or truncating tables and assuming the S3 objects are gone. Cleanup is asynchronous, and bugs or interrupted operations can leave orphans.
- Running mutations on S3-backed tables without understanding the write amplification. A simple
ALTER TABLE DELETEcan rewrite gigabytes.
Frequently Asked Questions
Q: Should I use the S3 engine or an S3 disk? A: Use the S3 engine for ad hoc queries over files you already have in a bucket, ingestion pipelines, and data lake integration. Use an S3 disk when you want a regular MergeTree table whose parts happen to live in S3, particularly for tiered storage.
Q: Can ClickHouse use S3-compatible storage like MinIO, GCS, or Wasabi?
A: Yes. Set the endpoint to the provider URL and supply credentials as you would for AWS. Some providers require region or signature version overrides.
Q: Does S3 storage support all MergeTree features? A: Most features work, including TTL, projections, and materialized views. Mutations work but are expensive. Some features around zero-copy replication and backups have version-specific caveats.
Q: How do I monitor S3 usage from ClickHouse?
A: Query system.events for S3ReadRequestsCount, S3WriteRequestsCount, ReadBufferFromS3Bytes, and WriteBufferFromS3Bytes. Pair these with CloudWatch S3 metrics for the bucket.
Q: What happens when I drop a table backed by S3? A: ClickHouse removes the metadata and queues the S3 objects for deletion. Deletion is asynchronous and can be delayed by ongoing merges or other replicas holding references. Plan for orphan cleanup as a maintenance task.