ClickHouse tables backed by S3 can accumulate objects that no live table references. This happens after drops, truncates, interrupted operations, and certain replication failure modes. The orphaned data continues to cost money and complicates audits. This article explains why orphans appear and how to find and remove them safely.
Why Orphans Appear
A ClickHouse table on an S3 disk is two things: local metadata under /var/lib/clickhouse/disks/<disk>/ and the actual data parts in the bucket. When you DROP or TRUNCATE, the metadata goes immediately, but the S3 objects do not. ClickHouse delays the actual delete because long-running queries, ongoing merges, or other replicas may still hold references to those objects.
Under normal conditions the background cleanup eventually catches up. Orphans appear when:
- The server crashes or is forcefully stopped between metadata removal and S3 deletion.
- Zero-copy replication metadata in ZooKeeper becomes inconsistent with the actual references.
- A bucket-level lifecycle policy interacts badly with in-flight operations.
- Bugs in specific ClickHouse versions skip cleanup paths.
- Operators manually edit metadata or restore from partial backups.
The result is data in S3 that no table accounts for and no automatic process will remove.
Detecting Orphans
Start by reconciling what the cluster knows about against what S3 holds. From SQL, list every S3-backed part the server is tracking:
SELECT
database,
table,
name AS part_name,
disk_name,
path
FROM system.parts
WHERE disk_name = 's3'
AND active = 1;
For the bucket side, list objects under the disk's prefix:
aws s3 ls s3://BUCKET/path/ --recursive --summarize
A diff between the two lists is your candidate orphan set. In practice this requires correlating the obfuscated object keys with the part metadata files, since ClickHouse does not store data under human-readable paths by default. Tools that automate this correlation are much safer than manual scripts.
Cleanup Strategies
Use a Garbage Collection Utility
Community garbage collection utilities such as s3gc automate the reconciliation and deletion. The general approach is to walk the local metadata, build the set of referenced S3 keys, list the bucket, and remove anything not referenced. Run any such tool during a maintenance window with the cluster idle, or with a sufficient grace period to avoid racing in-flight operations.
Typical invocation pattern:
s3gc --metadata-path /var/lib/clickhouse/disks/s3/ \
--bucket BUCKET --prefix path/ \
--age 24h --dry-run
Always start with --dry-run and review the proposed deletes before allowing the tool to actually remove objects.
Partition By Path Per Table or Replica
A preventive pattern: give each table or replica its own prefix inside the bucket. When you drop the table, you can delete the entire prefix from the AWS console or CLI without touching anything else.
<endpoint>https://s3.us-east-1.amazonaws.com/BUCKET/cluster_a/replica_1/table_x/</endpoint>
This trades a small amount of configuration complexity for a much simpler cleanup story.
Use clickhouse-disks
The clickhouse-disks utility lets you operate directly on a disk through ClickHouse's storage abstraction. Removing a path on the S3 disk:
clickhouse-disks --disk s3 \
--query "remove /cluster/database/table/replica1"
This uses ClickHouse's own code paths to delete, which is safer than crafting raw aws s3 rm commands because it respects ClickHouse's path conventions. Run it only when you are certain nothing else references the path.
Bucket Lifecycle Rules as a Safety Net
Configure an S3 lifecycle rule that expires objects under a temporary prefix after a defined retention. This is not a substitute for proper cleanup, but it bounds the worst case if your application logic ever puts orphans into a known location.
Safe Operating Procedure
A reliable orphan cleanup procedure looks like this:
- Quiesce the cluster: stop merges if possible, or run during low activity.
- Snapshot the local metadata directory so you can recover if something goes wrong.
- Run the reconciliation in dry-run mode and review the candidate orphans.
- Delete to a "trash" prefix first (via S3 copy then delete), or rely on bucket versioning so accidental deletes can be recovered.
- Wait at least one operation cycle, then permanently remove the trash prefix.
- Validate by re-running the reconciliation.
Common Pitfalls
- Deleting objects from S3 directly while ClickHouse still has references in metadata results in "missing part" errors on the next query or merge.
- Lifecycle rules that delete based on age alone can remove objects that ClickHouse still uses if the table is rarely accessed.
- Multi-replica setups with zero-copy replication share objects. Cleaning up on one replica can break another. Coordinate cluster-wide.
- S3 LIST results can lag in-flight large multi-part uploads or aborted operations. Wait until ongoing writes settle before reconciling. (AWS S3 has provided strong read-after-write consistency for all operations since December 2020, so stale reads on already-completed PUTs are not a concern.)
- LIST requests against very large prefixes are expensive. Paginate and budget for the request cost.
Frequently Asked Questions
Q: How long does ClickHouse normally wait before deleting S3 objects? A: Background deletion runs continuously and usually catches up within minutes for healthy operations. Long-running queries can extend that window to hours. Persistent orphans past a day usually indicate a real bug or operational issue.
Q: Will dropping a table delete its S3 data? A: Yes, eventually. The drop returns immediately, but the S3 deletion happens in the background. With zero-copy replication, deletion happens only after all replicas confirm.
Q: Can I just run aws s3 rm --recursive on the bucket?
A: Only if you have already dropped every ClickHouse table that uses that bucket and the cluster has fully reconciled. Otherwise you will break live tables.
Q: Is there a system table that lists S3 objects ClickHouse knows about?
A: system.remote_data_paths exposes the mapping between local metadata and remote storage paths. Querying it is the basis for any reconciliation tool.
Q: How do I prevent orphans in the first place? A: Use per-table or per-cluster prefixes, avoid hard-killing ClickHouse, keep ZooKeeper healthy in replicated setups, and avoid manual edits to disk metadata.