ClickHouse Replication: ReplicatedMergeTree, ClickHouse Keeper, and HA Architecture

ClickHouse replication operates at the table level, not the node level. Each replicated table independently tracks its own replication state, which means a single ClickHouse server can host some replicated tables and some non-replicated tables side by side. Replication is only available for the MergeTree engine family - ReplicatedMergeTree, ReplicatedSummingMergeTree, ReplicatedReplacingMergeTree, and so on. When a table uses one of these engines, ClickHouse coordinates all replication through an external distributed coordination service, either Apache ZooKeeper or the built-in ClickHouse Keeper.

The coordination layer stores replication metadata only - it does not hold actual row data. What ZooKeeper or Keeper tracks is the log of operations: which data parts exist, which block numbers have been assigned to prevent duplicate inserts, pending mutations, and the current state of each replica. The actual columnar data lives on the ClickHouse server's local filesystem. When a replica falls behind, it fetches missing parts directly from a peer replica over HTTP, not through the coordination service. This design keeps Keeper's data volume small and its latency irrelevant to query throughput, though its availability is still on the critical path for writes.

How ReplicatedMergeTree Coordinates Writes

When a client inserts data into a replicated table, the receiving replica writes the data part to its local filesystem and then records the part in ZooKeeper's replication log. Other replicas poll this log, see the new entry, and fetch the part from whatever replica already has it - typically the one that first received the insert. If multiple replicas are active and healthy, any of them can serve as the source for a part fetch. The replica that writes to Keeper first "wins" a given block number. Block numbers are monotonically assigned per partition to prevent duplicate ingestion within a shard: if a client retries an insert after a network timeout, ClickHouse uses the block number to detect and deduplicate the retry. Note that deduplication is scoped to a single shard — it does not prevent duplicates if retries are routed to different shards.

Merges are a separate coordination challenge. Any replica acting as a leader can decide which parts to merge and when — and multiple replicas can hold leader status simultaneously, each independently scheduling background merges. A leader writes a merge plan to the replication log, and all replicas execute the same merge independently on their local data — producing bit-for-bit identical parts. This avoids shipping merged parts across the network; every replica computes the merge itself. The trade-off is that merges require all replicas to have all the source parts before they can execute. A lagging replica that is missing parts will have its merge operations stuck in queue, waiting for a fetch to complete first.

Mutations (ALTER TABLE ... UPDATE, ALTER TABLE ... DELETE) follow the same log-based pattern. A mutation entry is written to ZooKeeper, and each replica applies it independently. Mutations in ClickHouse are not transactional in the traditional sense - they rewrite parts on disk asynchronously. You can track mutation progress in system.mutations.

ClickHouse Keeper: The Built-In Coordination Layer

ClickHouse Keeper is a from-scratch C++ implementation of ZooKeeper's data model and client protocol, using the Raft consensus algorithm instead of ZooKeeper's ZAB. It was introduced as an experimental feature and has been production-ready since ClickHouse 22.3 (the March 2022 LTS release). For any new deployment, Keeper is the right default. It ships inside the ClickHouse server binary itself, which means you can run Keeper on your ClickHouse nodes without a separate JVM process, though it can also be deployed standalone on dedicated nodes.

Keeper requires an odd number of nodes to maintain quorum - 3 nodes tolerate one failure, 5 nodes tolerate two. The quorum requirement is strict: if you lose enough Keeper nodes to drop below quorum, all replicated table writes in the cluster will block until quorum is restored. Reads can still proceed. A 3-node Keeper ensemble is the minimum practical setup for any production cluster where you care about write availability during a single node loss.

Compared to ZooKeeper, Keeper uses compressed snapshots and logs, which reduces disk space significantly. Its memory footprint is substantially lower than ZooKeeper for the same workload, which matters for large clusters with many replicated tables where ZooKeeper's heap usage can become a tuning problem.

Configure Keeper in your config.xml (or a file in config.d/):

<keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
    <coordination_settings>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <session_timeout_ms>30000</session_timeout_ms>
    </coordination_settings>
    <raft_configuration>
        <server>
            <id>1</id>
            <hostname>ch-node-1</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>2</id>
            <hostname>ch-node-2</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>3</id>
            <hostname>ch-node-3</hostname>
            <port>9234</port>
        </server>
    </raft_configuration>
</keeper_server>

Then point ClickHouse at Keeper using the zookeeper config section (the client protocol is identical):

<zookeeper>
    <node>
        <host>ch-node-1</host>
        <port>9181</port>
    </node>
    <node>
        <host>ch-node-2</host>
        <port>9181</port>
    </node>
    <node>
        <host>ch-node-3</host>
        <port>9181</port>
    </node>
</zookeeper>

Creating Replicated Tables

The ReplicatedMergeTree engine takes two parameters: the ZooKeeper path where this table's replication metadata lives, and a replica name unique within that path. The ZooKeeper path must be identical across all replicas of the same table shard. The replica name must differ.

The idiomatic approach uses macros defined per-node in the server config:

<!-- On node ch-node-1 -->
<macros>
    <cluster>production</cluster>
    <shard>01</shard>
    <replica>ch-node-1</replica>
</macros>

Then the CREATE TABLE statement uses those macros:

CREATE TABLE events
(
    event_time DateTime,
    user_id    UInt64,
    event_type LowCardinality(String),
    payload    String
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/{cluster}/tables/{shard}/{database}/{table}',
    '{replica}'
)
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, user_id, event_time);

Run this exact CREATE TABLE statement on every replica node. ClickHouse substitutes the macros at DDL execution time, so {shard} expands to 01 on both nodes (same shard), while {replica} expands to ch-node-1 and ch-node-2 respectively. If you have multiple shards, each shard gets a distinct {shard} macro value, which produces different ZooKeeper paths - meaning the two shards do not replicate to each other, which is the correct behavior.

Avoid reusing ZooKeeper paths across tables with different schemas. If you drop and recreate a table with the same path, ClickHouse may attempt to reconcile the old replication state with the new schema, producing confusing errors. Use a versioned path segment or append the table name explicitly.

Monitoring Replication Health and Lag

system.replicas is the primary view into replication state for every replicated table on the local server:

SELECT
    database,
    table,
    is_leader,
    is_readonly,
    absolute_delay,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    last_queue_update_exception,
    zookeeper_exception
FROM system.replicas
WHERE absolute_delay > 0 OR is_readonly = 1
   OR last_queue_update_exception != ''
   OR zookeeper_exception != ''
ORDER BY absolute_delay DESC;

absolute_delay reports how far behind this replica is in seconds relative to the most advanced replica in the ensemble. A non-zero value is normal during heavy ingestion; a value that keeps growing points to a fetch or merge backlog. is_readonly being 1 means the replica cannot accept writes - this typically happens when the Keeper connection is lost or when the replica's local metadata has diverged from what Keeper expects.

system.replication_queue shows the pending operations each replica needs to execute:

SELECT
    database,
    table,
    type,
    num_tries,
    last_exception,
    last_attempt_time,
    postpone_reason
FROM system.replication_queue
WHERE last_exception != ''
ORDER BY last_attempt_time DESC
LIMIT 20;

A growing queue with repeated exceptions in last_exception is the clearest signal of a stuck replica. Common causes are missing source parts (the replica that held the part is down and no other has it), network issues preventing HTTP fetches between replicas, or a Keeper node lagging behind the leader causing stale reads.

DETACHED parts — stored under /var/lib/clickhouse/data/<database>/<table>/detached/ (for Atomic databases, this is a symlink into the store/ directory) — appear when ClickHouse cannot attach a part during startup or when it detects a part as broken. Parts with a prefix of ignored_ were superseded by a larger merged part and are safe to delete. Parts with broken_ or unexpected_ prefixes need investigation before deletion; they may indicate disk corruption or a failed fetch. Query them via system.detached_parts.

Recovering a Diverged or Stuck Replica

The gentlest first step is a replica restart at the table level:

SYSTEM RESTART REPLICA events;

This detaches and re-attaches the table internally, re-reading state from Keeper without any data loss. If the queue is stuck on a specific part that no active replica has, you can drop the offending entry and let ClickHouse skip it:

-- Find the specific queue entry
SELECT * FROM system.replication_queue WHERE last_exception LIKE '%No active replica%';

-- Force ClickHouse to give up on a specific part
ALTER TABLE events DROP DETACHED PART 'part_name_here';

When a replica's local data has diverged severely - usually after a disk failure or prolonged Keeper unavailability - the recovery path is to wipe the replica's local data and let ClickHouse refetch everything from peers:

-- Run on the broken replica
SYSTEM RESTORE REPLICA events;

SYSTEM RESTORE REPLICA is used when local data is intact but Keeper metadata has been lost — for example, after a Keeper cluster was wiped. It moves local parts to the detached directory, re-registers the replica's metadata in Keeper, then reattaches those parts locally without fetching them from peers. This command only operates on read-only tables. If local data was actually lost (e.g. disk failure), the replica will catch up by fetching missing parts from peers once it reconnects to Keeper and processes the replication log. If you want to preserve a snapshot before recovery, use ALTER TABLE events FREEZE first — it creates a hard-linked backup under shadow/ without interrupting reads.

One failure mode worth knowing: if all replicas of a shard lose their Keeper metadata simultaneously (after a Keeper cluster is wiped and restored from an incomplete backup, for example), ClickHouse will see a ZooKeeper path that doesn't match the local parts. In this case you need to drop the replica metadata from Keeper, re-initialize it, and then run SYSTEM RESTORE REPLICA to rebuild the part list from local disk. The SYSTEM DROP REPLICA DDL command handles the Keeper cleanup step without requiring manual ZooKeeper node manipulation.