ClickHouse Desharding: Merging Many Shards Into Fewer

Q: How do I verify nothing was lost?

Compare SELECT count() FROM table and SELECT sum(rows) FROM system.parts WHERE active between the source cluster (summed across shards) and the target. Also review system.detached_parts for anything left behind with a non-empty reason .

Desharding is the operation of consolidating a ClickHouse cluster with N shards into a cluster with fewer shards (often one). It comes up when an over-provisioned cluster needs to shrink, when a multi-shard prototype graduates to a simpler single-shard design, or when storage and CPU growth on a per-node basis makes fewer shards economical. There is no single command; the workable approach is a backup-driven migration that restores schema once and reattaches data shard by shard. This guide walks through that pattern using clickhouse-backup, with the SQL needed to inventory, attach, and verify.

The Core Approach

Three rules drive the migration:

Restore the schema only once. Every source shard contains the same DDL. Restoring N times duplicates nothing useful and risks divergence.
Restore each shard's data into detached/ and attach incrementally. This isolates the work per shard and lets you verify before committing.
Recreate Distributed tables for the new topology. The cluster definition changes; routing tables must be rebuilt to match.

The migration consolidates N old shards into M new shards (M < N). Most commonly M = 1, which is the simplest case and the one shown here. For M > 1, run the same procedure on each target shard for the subset of source shards mapped to it.

Step 1: Take Backups of Every Source Shard

Run clickhouse-backup on each source shard separately. Name backups so you can tell them apart:

# On source shard 1
clickhouse-backup create_remote shard01_20260319_full

# On source shard 2
clickhouse-backup create_remote shard02_20260319_full

# ... and so on

Use remote storage (S3, GCS, or a shared filesystem) so the target node can pull each backup independently.

Step 2: Restore Schema Once

On the new (consolidated) shard, restore schema only from one of the backups. Pick the most recent and most representative source:

clickhouse-backup restore_remote --schema shard01_20260319_full

--schema restores DDL (databases, tables, dictionaries) but no data. Verify with:

SHOW DATABASES;
SHOW TABLES FROM db1;

Critically, the restored tables must be local tables (the underlying *_local MergeTree or ReplicatedMergeTree). Distributed tables will be recreated at the end against the new cluster topology.

Step 3: Restore Each Shard's Data Into `detached/`

For each source shard, restore data into the detached/ directory rather than directly attaching:

clickhouse-backup restore_remote \
  --data \
  --tables="db1.*_local,db2.*_local" \
  --replicated-copy-to-detached \
  shard01_20260319_full

--replicated-copy-to-detached stages parts under each table's detached/ directory without attaching them. ClickHouse will not see them until you explicitly attach.

Repeat for every source shard:

clickhouse-backup restore_remote --data --tables="db1.*_local,db2.*_local" \
  --replicated-copy-to-detached shard02_20260319_full

clickhouse-backup restore_remote --data --tables="db1.*_local,db2.*_local" \
  --replicated-copy-to-detached shard03_20260319_full

Do not restore all shards then attach in bulk. Restore one shard, attach its parts, verify, then move on.

Step 4: Generate and Run ATTACH PART Statements

Inventory the detached parts and emit the ATTACH SQL:

SELECT concat(
    'ALTER TABLE `', database, '`.`', table,
    '` ATTACH PART ', quoteString(name), ';'
) AS attach_sql
FROM system.detached_parts
WHERE database IN ('db1', 'db2')
  AND ifNull(reason, '') = ''
ORDER BY database, table, partition_id, min_block_number;

Pipe this into clickhouse-client or run it interactively. The ifNull(reason, '') = '' filter excludes parts that ClickHouse already flagged as broken or otherwise unattachable.

After attach, verify part counts:

SELECT database, table, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE active AND database IN ('db1', 'db2')
GROUP BY database, table;

Cross-check against the source totals you captured before migration.

Step 5: Recreate Distributed Tables

The old cluster had multiple shards. The new cluster has fewer. Update remote_servers in config.xml to describe the new topology, then recreate Distributed tables:

DROP TABLE IF EXISTS db1.events;

CREATE TABLE db1.events AS db1.events_local
ENGINE = Distributed('cluster_1shard', 'db1', 'events_local',
                     cityHash64(user_id));

cluster_1shard is the new cluster definition with the single target shard. The sharding key (cityHash64(user_id)) is preserved for forward compatibility if you ever re-shard outward later.

Step 6: Verify Keeper Paths and Macros

If the source was using ReplicatedMergeTree, every restored table has Keeper paths baked in. The new shard must have:

A clean ZooKeeper/Keeper path that does not collide with old shards.
macros in config.xml (shard, replica) reflecting the new topology.

A common mistake is restoring with the old {shard} macro pointing at, say, shard01. New replicas joining the consolidated shard then trip over conflicting metadata.

Inspect the Keeper paths after attach:

SELECT database, table, zookeeper_path, replica_path
FROM system.replicas;

If a path embeds an old shard identifier, rename via ALTER TABLE ... MODIFY SETTING or recreate with explicit paths.

Step 7: Final Verification

Re-check detached parts for anything that did not attach:

SELECT database, table, partition_id, name, reason
FROM system.detached_parts
WHERE database IN ('db1', 'db2')
ORDER BY database, table, partition_id;

Anything with a non-empty reason needs manual inspection. Common reasons include broken, clone, or covered-by-broken.

Compare row counts to the source totals:

SELECT database, table, sum(rows) AS total
FROM system.parts
WHERE active AND database = 'db1'
GROUP BY database, table
ORDER BY table;

Common Pitfalls

Restoring schema multiple times. Wastes time and risks divergence if any source shard had drift. Always --schema once.
Attaching parts from multiple shards in bulk before verifying. A bad shard contaminates the merge of inventory. One shard at a time.
Forgetting to update macros and remote_servers. The new shard inherits Keeper paths that no longer make sense.
Skipping the detached-parts review. Any reason value other than empty is a part you have not actually loaded. Address them before declaring the migration done.
Recreating Distributed tables before all local data is attached. Queries against the Distributed table will return partial results until every shard's parts are in place.

Frequently Asked Questions

Q: Can ClickHouse merge shards automatically? A: No. There is no built-in "deshard" command. The supported pattern is a controlled migration: back up each source shard, restore schema once on the new shard, restore data into detached/, and attach incrementally.

Q: Is clickhouse-copier an alternative? A: It can move data between clusters, but it does not handle metadata, replicas, or schema in one pass. For consolidation specifically, the clickhouse-backup + ATTACH workflow is more predictable and easier to verify.

Q: Do I need to stop writes during the migration? A: For a clean cutover, yes. Take each source backup at a known-quiet moment, then redirect writes to the new shard after the final ATTACH. A live migration is possible but requires capturing deltas and replaying them.

Q: What happens to ReplicatedMergeTree Keeper paths during desharding? A: They must change. Restoring metadata that points to old Keeper paths breaks replication on the new shard. Either rewrite the DDL before restoring, or use macros so each environment computes its own path.

Q: How do I verify nothing was lost? A: Compare SELECT count() FROM table and SELECT sum(rows) FROM system.parts WHERE active between the source cluster (summed across shards) and the target. Also review system.detached_parts for anything left behind with a non-empty reason.

Q: Can I deshard from N to M where M > 1? A: Yes. Map each source shard to a target shard (typically by modulo or range on the sharding key), then run the procedure on each target shard for its assigned subset of source backups. The schema restore is still once-per-target.