Adding and Removing Replicas in a Live ClickHouse Cluster

Q: Do I need to copy data files to the new replica manually?

No. Once the new node creates the ReplicatedMergeTree table with the correct Keeper path and a unique replica name, ClickHouse registers it in Keeper and fetches all existing parts from the other replicas automatically. You only copy the DDL, not the data.

Q: Why won't `SYSTEM DROP REPLICA` remove my replica?

It only drops inactive replicas, and it cannot drop the local one. Run it from a different replica, and make sure the target server is stopped (or its tables detached) so it shows as inactive. To drop the local replica's table use DROP TABLE instead.

Q: Can I throttle the initial sync so it doesn't overload the cluster?

Yes. Set max_replicated_fetches_network_bandwidth_for_server (bytes/sec) on the new node before it starts pulling, and optionally max_replicated_sends_network_bandwidth_for_server on the donor nodes to cap how fast they push. Relax both once the bulk sync completes.

Q: How do I remove a replica whose node is already dead and gone?

Use the path-based form: SYSTEM DROP REPLICA ' ' FROM ZKPATH '/clickhouse/tables/ / / ' from a surviving replica. This works even when the table no longer exists locally to reference, then remove the dead host from remote_servers .

Adding or removing a replica in a running ClickHouse cluster is a routine operational task, but the order of steps matters. Replicas coordinate through ClickHouse Keeper (or ZooKeeper), so the data plane (table parts on disk) and the control plane (cluster topology in remote_servers) have to be changed in the right sequence to avoid clients hitting a half-initialized node or a ghost replica lingering in Keeper.

This guide covers the full lifecycle: provisioning a new node, applying schema, letting it sync, wiring it into the cluster, and later draining and dropping a replica cleanly. For background on how replication works internally and how to read replica state, see ClickHouse replication. For the static cluster topology this guide modifies, see production cluster configuration.

How a Replica Is Identified

A ReplicatedMergeTree table is bound to a Keeper path and a replica name, normally through macros:

CREATE TABLE events ON CLUSTER '{cluster}'
(
    event_date Date,
    user_id UInt64,
    amount Decimal(10, 2)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/events', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

The {shard} and {replica} placeholders are resolved per node from that node's macros configuration. Two replicas of the same shard share the same Keeper path (same {shard}) but must have distinct {replica} values. Getting this wrong is the single most common cause of a botched scale-up: if a new node reuses an existing replica name, it tries to attach to an already-registered replica path and fails.

<!-- /etc/clickhouse-server/config.d/macros.xml on the NEW node -->
<clickhouse>
    <macros>
        <cluster>my_cluster</cluster>
        <shard>01</shard>
        <replica>chnode3</replica>
    </macros>
</clickhouse>

Adding a Replica

The guiding principle: bring the node fully into sync before you advertise it to clients. Do not add the node to remote_servers until its data has caught up, otherwise the Distributed engine and load balancers will route queries to a node that is still fetching parts and may return partial results.

Step 1 — Provision the node and configure macros

Install the same ClickHouse version as the rest of the cluster, point it at the same Keeper ensemble (the <zookeeper> section must match the existing nodes), and set the macros as shown above with a unique {replica} value. Do not yet add it to the cluster definition.

For large datasets, throttle the initial fetch so the sync does not saturate the network. Set a fetch limit on the new node and, optionally, a send limit on the donor nodes:

<!-- On the new (destination) node -->
<clickhouse>
    <max_replicated_fetches_network_bandwidth_for_server>50000000</max_replicated_fetches_network_bandwidth_for_server>
</clickhouse>

<!-- On existing (source) nodes, to cap how fast they push -->
<clickhouse>
    <max_replicated_sends_network_bandwidth_for_server>50000000</max_replicated_sends_network_bandwidth_for_server>
</clickhouse>

These values are bytes per second per server. Remove or raise them once the bulk sync is done.

Step 2 — Apply the schema

The new replica needs identical table DDL before it can join. Run the CREATE TABLE ... ReplicatedMergeTree(...) statements on the new node. Because the table is replicated, once a node creates the table with the correct Keeper path and a new replica name, ClickHouse automatically registers it in Keeper and begins fetching existing parts from the other replicas — you do not copy data files manually.

To replicate an entire server's schema, dump the DDL from an existing replica and apply it. Generate the statements idempotently:

-- Run on an existing replica to export table DDL
SELECT replaceRegexpOne(
    concat(create_table_query, ';'),
    'CREATE (TABLE|DICTIONARY|VIEW|MATERIALIZED VIEW)',
    'CREATE \\1 IF NOT EXISTS')
FROM system.tables
WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA')
  AND create_table_query != ''
INTO OUTFILE '/tmp/schema.sql'
FORMAT TSVRaw;

Apply DDL in dependency order: databases first, then base tables, then materialized views and dictionaries that depend on them. A materialized view created before its source table will fail. If you use clickhouse-backup, restoring with --schema only (no data) reproduces the full structure in the correct order, after which replication fills in the data — see ClickHouse backup and restore.

Step 3 — Let the replica sync and monitor progress

Once the tables exist, the new replica pulls parts in the background. Watch the work in flight:

-- Parts currently being fetched, with progress and ETA
SELECT
    database,
    table,
    source_replica_hostname,
    round(progress * 100, 1) AS pct,
    elapsed,
    total_size_bytes_compressed
FROM system.replicated_fetches
ORDER BY elapsed DESC;

Check how far behind each table still is via the replication queue and absolute delay:

SELECT
    database,
    table,
    queue_size,
    inserts_in_queue,
    absolute_delay,
    is_readonly
FROM system.replicas
WHERE absolute_delay > 0 OR queue_size > 0
ORDER BY absolute_delay DESC;

The node is caught up when queue_size and absolute_delay reach (and stay near) zero. For a deeper read on what the queue entries mean and how to unstick a stalled one, see the replication queue guide. To block until a specific table is fully synced before proceeding, use:

SYSTEM SYNC REPLICA db.events;

SYSTEM SYNC REPLICA waits for the replication queue to drain on the current replica. Add STRICT to require the queue to be fully empty, or LIGHTWEIGHT to wait only for GET_PART, ATTACH_PART, DROP_RANGE, REPLACE_RANGE, and DROP_PART entries rather than every queue entry — useful on busy tables where the queue never reaches exactly zero.

Step 4 — Add the node to the cluster topology

Only after the replica is in sync, add it to the <remote_servers> definition on all nodes so the Distributed engine and clients start routing to it:

<remote_servers>
    <my_cluster>
        <shard>
            <internal_replication>true</internal_replication>
            <replica><host>chnode1</host><port>9000</port></replica>
            <replica><host>chnode2</host><port>9000</port></replica>
            <!-- newly added -->
            <replica><host>chnode3</host><port>9000</port></replica>
        </shard>
    </my_cluster>
</remote_servers>

remote_servers changes are picked up without a restart when supplied via a config.d file. Keep internal_replication set to true for replicated tables so a Distributed insert writes to one replica and lets ClickHouse replicate it, rather than inserting the same data to every replica.

Removing a Replica

Removing a replica is the reverse: stop sending it traffic first, let it finish any pending work, then evict its metadata from Keeper. Connect to a different replica for the drop step — you cannot drop the local replica with SYSTEM DROP REPLICA.

Step 1 — Drain client traffic

Remove the target replica from <remote_servers> on all nodes so the Distributed engine stops routing queries and inserts to it. This is the "draining" step. The node still holds its data and stays in sync via Keeper; it is simply no longer advertised to clients.

Step 2 — Let it catch up (for a healthy node)

If you are removing a healthy replica (for example, scaling a shard back down), let it finish replicating so no data that was inserted directly to it is lost before other replicas have a copy. Confirm there is nothing left in its outbound path:

-- Run on the replica being removed
SELECT database, table, queue_size, absolute_delay
FROM system.replicas
WHERE queue_size > 0 OR absolute_delay > 0;

When this returns no rows, every part it owns is also on the surviving replicas.

Step 3 — Detach tables and stop the node

On the node being removed, you can DETACH the replicated tables (or simply stop the server). Stopping the clickhouse-server process marks the replica inactive in Keeper. The data on disk is untouched.

Step 4 — Drop the replica from Keeper

From a surviving replica, evict the now-inactive replica's metadata so Keeper does not keep tracking it (and so its block numbers and queue entries get cleaned up). List the replica names first:

SELECT DISTINCT arrayJoin(mapKeys(replica_is_active)) AS replica_name
FROM system.replicas;

Then drop, choosing the scope you need:

-- One table
SYSTEM DROP REPLICA 'chnode3' FROM TABLE db.events;

-- All replicated tables in a database
SYSTEM DROP REPLICA 'chnode3' FROM DATABASE db;

-- All replicated tables known to this server
SYSTEM DROP REPLICA 'chnode3';

-- By explicit Keeper path (when the table no longer exists locally)
SYSTEM DROP REPLICA 'chnode3' FROM ZKPATH '/clickhouse/tables/01/db/events';

SYSTEM DROP REPLICA only removes the inactive replica's metadata from Keeper. It does not delete any data or metadata from disk, and it refuses to drop an active replica or the local one. If the replica still shows as active, make sure its server is stopped (or its table detached) before dropping. The FROM ZKPATH form is the escape hatch for a dead node whose table you can no longer reference because it was already removed.

Step 5 — Verify

Re-run the listing query and confirm the dropped name is gone:

SELECT DISTINCT arrayJoin(mapKeys(replica_is_active)) AS replica_name
FROM system.replicas;

Handling a Lagging Replica During Removal

If the replica you want to remove is far behind and you do not care about its uncommitted local writes (it is being decommissioned anyway), you do not need to wait for it to catch up — its parts already exist on the other replicas as long as inserts were done through the Distributed engine with internal_replication = true. Drain it, stop it, and drop it.

The dangerous case is a replica that received direct inserts (clients writing to it specifically) that have not yet replicated out. Removing it then loses those rows. Before dropping such a node, run SYSTEM SYNC REPLICA against a surviving replica and confirm absolute_delay is zero on the donor side, or push the outstanding parts first. When a queue is genuinely stuck and blocking progress, the replication queue guide and replication problems diagnosis cover how to inspect and clear individual entries.

Best Practices

Sync before you advertise. Never add a node to remote_servers until queue_size and absolute_delay are at zero. A half-synced node in the cluster definition returns partial results.
Unique replica macro per node. Each replica of a shard must have a distinct {replica} value but share the same {shard}. Reusing a name causes REPLICA_ALREADY_EXISTS.
Throttle large initial syncs. Set max_replicated_fetches_network_bandwidth_for_server on the new node before it starts pulling, then relax it afterward, so the sync does not starve production traffic.
Apply DDL in dependency order. Databases, then tables, then materialized views and dictionaries. Use IF NOT EXISTS so a partially-applied schema can be re-run safely.
Always drop the Keeper metadata when removing a node. Stopping the server is not enough — an undropped replica leaves stale entries in Keeper that hold references to block numbers and can bloat the queue. Run SYSTEM DROP REPLICA from a surviving node.
Keep internal_replication = true for replicated tables so Distributed inserts replicate via Keeper instead of duplicating writes to every replica.

Common Issues

REPLICA_ALREADY_EXISTS on table creation — the Keeper path still holds a replica with that name from a previous attempt. Run SYSTEM DROP REPLICA '<name>' FROM ZKPATH '<path>' from another node, then retry the CREATE TABLE.
New replica stuck in read-only mode — the node cannot reach Keeper or its <zookeeper> config does not match the rest of the cluster. Check is_readonly in system.replicas and verify connectivity; see ZooKeeper session expired and the ZooKeeper configuration guide.
SYSTEM DROP REPLICA reports the replica is still active — the target server is still running or the table is still attached. Stop the node (or DETACH the table) first; the command only evicts inactive replicas.
Data exists on disk but Keeper metadata was lost — this is a recovery scenario, not a normal add. Use SYSTEM RESTORE REPLICA db.table on the read-only table to re-register it from its local parts rather than re-fetching everything.

How Pulse Helps

Adding and removing replicas touches the riskiest seams of a ClickHouse cluster: Keeper coordination, replication lag, and cluster topology. Pulse continuously monitors system.replicas and system.replicated_fetches across every node, so during a scale-up you can see exactly when a new replica has truly caught up — queue_size and absolute_delay at zero — before you wire it into remote_servers. During a scale-down, Pulse flags replicas with unreplicated local writes or non-empty queues so you do not drop a node that still holds the only copy of some data, and it surfaces stale Keeper entries left behind by an incomplete removal. The same expert team that builds Pulse can review your add/remove runbooks and help size shards and replicas for your workload.

Frequently Asked Questions

Q: Do I need to copy data files to the new replica manually?

No. Once the new node creates the ReplicatedMergeTree table with the correct Keeper path and a unique replica name, ClickHouse registers it in Keeper and fetches all existing parts from the other replicas automatically. You only copy the DDL, not the data.

Q: In what order should I add the node to remote_servers?

Last. Configure macros and Keeper, create the tables, wait for the replica to fully sync (queue_size and absolute_delay at zero), and only then add it to remote_servers on all nodes. Adding it earlier routes client queries to a node that is still fetching data.

Q: Why won't SYSTEM DROP REPLICA remove my replica?

It only drops inactive replicas, and it cannot drop the local one. Run it from a different replica, and make sure the target server is stopped (or its tables detached) so it shows as inactive. To drop the local replica's table use DROP TABLE instead.

Q: Does SYSTEM DROP REPLICA delete data?

No. It removes the replica's metadata from ClickHouse Keeper / ZooKeeper only. Data and table metadata on the dropped node's disk are left untouched — clean those up separately if you are reclaiming the host.

Q: Can I throttle the initial sync so it doesn't overload the cluster?

Yes. Set max_replicated_fetches_network_bandwidth_for_server (bytes/sec) on the new node before it starts pulling, and optionally max_replicated_sends_network_bandwidth_for_server on the donor nodes to cap how fast they push. Relax both once the bulk sync completes.

Q: How do I remove a replica whose node is already dead and gone?

Use the path-based form: SYSTEM DROP REPLICA '<replica_name>' FROM ZKPATH '/clickhouse/tables/<shard>/<db>/<table>' from a surviving replica. This works even when the table no longer exists locally to reference, then remove the dead host from remote_servers.