ClickHouse Disaster Recovery Across Two Data Centers

Q: Can I run two independent clusters, one per DC, and replicate between them?

That is a different design (cross-cluster copy or MaterializedView shipping). It is more complex to operate than a single stretched cluster and offers similar RPO. Use stretched-cluster unless inter-DC latency is too high.

Q: What is the typical RPO and RTO?

RPO is near zero because ReplicatedMergeTree ships parts continuously. RTO is bounded by how long the Keeper promotion procedure takes, typically minutes if scripted.

A two-data-center setup for ClickHouse is not the same as classical active-passive database failover. The recommended topology is a single logical cluster that spans both sites, where replicas in the secondary DC stay live but receive no application traffic. The trick is placing the coordination layer, Keeper or ZooKeeper, so that one site holds quorum, the other holds a warm copy, and a documented procedure exists for promoting the secondary site when the primary becomes unreachable.

Topology

Both DCs participate in the same remote_servers block. The application points at DC A replicas; DC B replicas exist but are not used for routine queries.

<remote_servers>
    <company_cluster>
        <shard>
            <replica>
                <host>ch1.dc-a.company.com</host>
            </replica>
            <replica>
                <host>ch2.dc-a.company.com</host>
            </replica>
            <replica>
                <host>ch1.dc-b.company.com</host>
            </replica>
            <replica>
                <host>ch2.dc-b.company.com</host>
            </replica>
        </shard>
    </company_cluster>
</remote_servers>

All four nodes hold the same data via ReplicatedMergeTree. Inter-DC bandwidth is consumed continuously, which is the price of a hot standby.

Keeper placement

Coordination is the part that needs careful planning. ClickHouse Keeper (or ZooKeeper) drives DDL distribution, replication coordination, and replicated RBAC. Quorum requires a majority of voting members, so a symmetric three-and-three split would deadlock if either DC is cut off.

The pragmatic layout:

DC A (active): 3 voting Keeper nodes. They form quorum on their own.
DC B (passive): 1 Keeper node configured as an observer. Observers receive state but do not vote, so they do not stop quorum when DC B disappears.

This means DC A can fail independently of DC B without forming split-brain, and the observer in DC B has a current snapshot of the data ready to be promoted.

Failover when DC A is lost

Promotion is manual and deliberate. Skipping a step leaves the cluster wedged.

Confirm DC A is fully down. Half-down primaries cause split-brain.
Shut down any remaining ClickHouse processes in DC A so they cannot rejoin later with stale state.
Reconfigure the Keeper observer in DC B as a voting member and restart it.
Bring up two additional Keeper nodes in DC B so the new ensemble has three voting members.
Update the ClickHouse keeper_server and zookeeper sections on all DC B nodes to point at the new ensemble, then roll-restart ClickHouse.
Repoint application traffic to DC B.

When DC A returns, treat its nodes as fresh replicas: clear local data on the ones that have diverged, let ReplicatedMergeTree re-sync from DC B, and reverse roles before failing back.

Configuration management

Drift between sites is the most common failure mode in two-DC setups. Keep the configuration source of truth outside the servers:

Use ON CLUSTER for every DDL change so both sites apply the same statement.
Store RBAC in Keeper using <replicated> user directory so users and grants propagate automatically. See the RBAC replication article for the exact config.
Manage XML and macros through Ansible, Puppet, or another config management tool. Apply changes to both DCs in the same run.

Keeper vs. ZooKeeper

ClickHouse Keeper is optimized for ClickHouse workloads, uses less memory, and avoids the JVM operational footprint. ZooKeeper is still a valid choice when an existing operations team already runs it. The DR procedure is the same either way: observer in the standby DC, ensemble in the active DC, manual promotion at failover time.

Common Pitfalls

Splitting Keeper 3 and 3 across DCs. Loss of either site collapses quorum.
Forgetting to shut down DC A completely before promoting DC B. Two writable primaries diverge ReplicatedMergeTree znodes.
Letting DC A rejoin without resync. Stale parts conflict with the now-authoritative DC B history.
Manual XML edits on individual nodes. Drift accumulates and surfaces during failover.
Sizing inter-DC bandwidth for steady state only. Backfills and large merges spike well above baseline.
Using async replication for Distributed writes between DCs and assuming durability. The buffer is local and lost if the node dies.

Frequently Asked Questions

Q: Can I run two independent clusters, one per DC, and replicate between them? A: That is a different design (cross-cluster copy or MaterializedView shipping). It is more complex to operate than a single stretched cluster and offers similar RPO. Use stretched-cluster unless inter-DC latency is too high.

Q: What is the typical RPO and RTO? A: RPO is near zero because ReplicatedMergeTree ships parts continuously. RTO is bounded by how long the Keeper promotion procedure takes, typically minutes if scripted.

Q: Does the observer need to be the same Keeper version as the voters? A: Yes. Run identical Keeper versions across all members. Mismatches cause subtle wire-protocol issues during promotion.

Q: How do I prevent the application from reading from DC B during normal operation? A: Use load balancer or DNS-level routing rather than ClickHouse-level controls. The cluster definition lists all four replicas because they all need to be reachable for replication.

Q: Is SYSTEM RESTART REPLICA enough to recover DC A after failback? A: Only if the local data is consistent. If parts have diverged, drop the replica from Keeper with SYSTEM DROP REPLICA and let it re-clone from DC B.