Mutations on S3-backed MergeTree tables in ClickHouse have very different cost profiles depending on whether the mutation can use the primary index. This article walks through measured behavior on a 700 million row table, explains why the difference is so large, and gives practical guidance on table design and operations.
Why Mutations Matter on S3
A ClickHouse mutation is not an in-place update. The engine rewrites the affected parts to apply the change and atomically swaps the new parts in. On local disk this is expensive but bounded by your SSD throughput. On S3 the same operation translates into network traffic that you pay for, both in time and in bucket request charges. Understanding the read and write volumes of a mutation lets you make informed decisions about table design and DML strategy.
Test Setup
The reference benchmark uses a table with 700 million rows distributed across 70 partitions on an S3 disk, with merges disabled to isolate mutation behavior from background activity. The storage policy includes both a default local disk and an S3 disk with prefer_not_to_merge enabled.
Schema:
- Column
A(Int64): ~2.22 GiB compressed, included in the primary key - Column
S(String): ~2.33 GiB compressed, not indexed - Column
D(Date): ~5.09 MiB compressed - Total table size on S3: ~4.58 GiB
Mutation by Primary Key
A delete that filters on the primary key column A runs efficiently:
ALTER TABLE t DELETE WHERE A = 42;
Measured behavior:
| Metric | Value |
|---|---|
| Duration | ~17 to 19 seconds |
| S3 reads | ~23 MiB |
| S3 writes | ~27 KiB |
The primary index lets ClickHouse skip most granules. Only the granules that potentially contain matching rows are read and rewritten. Because the matching rows are a small fraction of the data, the output is tiny.
Mutation on a Non-Indexed Column
A delete that filters on the unindexed string column S cannot skip granules:
ALTER TABLE t DELETE WHERE S = 'some-value';
Measured behavior:
| Metric | Value |
|---|---|
| Duration | ~29 to 31 seconds |
| S3 reads | ~2.39 to 2.42 GiB |
| S3 writes | ~41 MiB |
The engine must read the entire S column to evaluate the predicate. The write volume is larger than the primary key case because more parts are touched and rewritten end to end, including columns the predicate did not reference.
Why Writes Grow Faster Than the Logical Change
Even a delete that removes a handful of rows can produce tens of MiB of writes. The reason is structural: ClickHouse writes whole parts, not row-level changes. Every part that contains at least one affected row is rewritten in full. Columns the mutation did not touch are still copied into the new part because the part is the atomic unit.
The amplification is most painful when:
- Partitions are large and contain a few scattered rows to change.
- The mutation touches many partitions but few rows per partition.
- The table has wide rows with many large columns.
Design Guidance
These patterns reduce mutation cost on S3-backed tables:
- Align mutations with the primary key. If you regularly delete by user_id, put user_id in the
ORDER BY. This is the single biggest lever. - Use lightweight deletes where supported.
DELETE FROM t WHERE ...(the lightweight DELETE syntax) marks rows as deleted via a mask file instead of rewriting parts. Read cost increases slightly but mutation cost drops dramatically. - Batch mutations. Combining multiple delete predicates into one
ALTER TABLE ... DELETE WHERE A IN (...)is cheaper than running many small mutations because each mutation iteration rewrites parts. - Partition with mutations in mind. If you delete by month, partition by month so a delete drops or rewrites only the relevant partition.
- Avoid mutations on cold tiers. Move data to S3 only after the table has stabilized. Mutations on S3 cost much more than mutations on local disk.
- Use
TTLfor time-based deletion. TTL-driven drops of whole parts or partitions are much cheaper than equivalent ALTER DELETEs.
Monitoring Mutation Cost
system.mutations shows in-flight mutations with their progress. To measure S3 traffic specifically, pair it with system.events:
SELECT event, value
FROM system.events
WHERE event LIKE 'S3%';
Pre and post snapshots of these counters around a mutation give a precise read and write volume for the operation. Combine with CloudWatch S3 metrics for the bucket if you want billing-side validation.
Common Pitfalls
- Running
ALTER TABLE DELETEon a large S3-backed table during business hours, then watching costs spike from request charges and bandwidth. - Assuming a delete that touches a single row writes only a few bytes. It rewrites every part that contains that row.
- Forgetting that mutations on S3 contend with normal query traffic for the same S3 request budget.
- Issuing many tiny mutations instead of one larger one. Each mutation has overhead.
- Mutating against a non-indexed predicate in production when an equivalent indexed predicate exists.
Frequently Asked Questions
Q: Are S3 mutations slower than local mutations? A: Yes, often substantially. The difference is dominated by network round trips to read and write parts. CPU work is similar.
Q: Can I cancel a running mutation on S3?
A: Yes, with KILL MUTATION. The work already done is not rolled back, but no further part rewrites occur. Partial outputs are cleaned up.
Q: Are lightweight deletes safe for S3-backed tables? A: Yes, and they are typically the right choice. Verify the feature is enabled in your version and account for the small read overhead from applying the delete mask at query time.
Q: How do mutations interact with TTL moves? A: TTL moves and mutations both rewrite parts. If a part is mid-mutation when a TTL move triggers, the move waits. Sequential serialization can extend mutation completion times.
Q: Does write amplification depend on column compression? A: Yes. Wider, less compressible columns produce more bytes written even when the row count change is identical.