ReplicatedMergeTree in ClickHouse: Replication via ClickHouse Keeper

What is ReplicatedMergeTree?

ReplicatedMergeTree is a MergeTree-family engine that replicates data parts across multiple ClickHouse servers using ClickHouse Keeper (or ZooKeeper 3.4.5+) as the coordination service. Each replica writes parts locally and registers them in Keeper; other replicas detect new entries in the replication queue and pull the parts from the original writer. The engine is the foundation for high-availability ClickHouse: combined with a Distributed table on top, it gives you per-shard fault tolerance and multi-replica read parallelism. Every Replicated* engine in the MergeTree family (ReplicatedReplacingMergeTree, ReplicatedAggregatingMergeTree, etc.) follows the same coordination pattern.

How ReplicatedMergeTree Works

A ReplicatedMergeTree table is created with a Keeper path and a replica name, both supporting {shard}, {replica}, {database}, {table} macros defined in config.xml:

CREATE TABLE events ON CLUSTER my_cluster
(
    event_time DateTime,
    event_type LowCardinality(String),
    user_id    UInt64,
    payload    String
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',  -- Keeper path (same per shard)
    '{replica}'                            -- replica name (unique per replica)
)
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id);

When a replica receives an INSERT, it writes the part to its own disk, registers a GET_PART log entry in Keeper, and acknowledges the client. By default the insert is acknowledged after a single replica writes; setting insert_quorum = N (or insert_quorum = 'auto') makes the insert wait until at least N replicas confirm the part. Background merges follow the same protocol: a designated leader (any replica) decides a merge, registers a MERGE_PARTS entry, and replicas execute it locally. Deduplication is automatic for the last 100 blocks per partition based on block hash.

The Keeper directory tree under '/clickhouse/tables/{shard}/events' stores replica registrations, the replication log, mutation queue, leader election, and metadata. Losing Keeper means losing the ability to coordinate new writes; existing replicas continue to serve reads.

ReplicatedMergeTree vs MergeTree

Feature MergeTree ReplicatedMergeTree
Storage Local only Local + replicated via Keeper
Coordination service None ClickHouse Keeper or ZooKeeper
Insert path Direct to disk Disk + Keeper log entry
HA on node failure None - data is local Reads continue from peer replicas
Insert acknowledgement After local write After local write (or quorum if set)
Background merges Local Coordinated through Keeper log
Deduplication on INSERT None Last 100 block hashes per partition
Required CREATE TABLE args None Keeper path + replica name

A common production layout is a Distributed table over a sharded set of ReplicatedMergeTree tables: the distributed table fans out reads/writes across shards, and each shard has 2-3 ReplicatedMergeTree replicas for failover.

Common Pitfalls

  1. Reusing the same Keeper path across two different physical tables - this corrupts replication state. Always include {shard} and a unique table name segment in the path.
  2. Forgetting ON CLUSTER when running CREATE TABLE - the table will only exist on one replica and replication will not start until you create it on every other node.
  3. Running ZooKeeper on the same hosts as ClickHouse without isolating I/O. ZK write latency directly bounds insert throughput.
  4. Setting insert_quorum = N higher than the number of healthy replicas - inserts hang until a replica recovers or you lower the quorum.
  5. Restoring a single replica from backup without SYSTEM SYNC REPLICA afterward. Replication queue lag silently grows until you trigger a re-sync.

Monitoring ReplicatedMergeTree

Two system tables are essential:

-- Per-replica health, lag, and queue depth
SELECT database, table, is_leader, is_readonly, future_parts,
       absolute_delay, queue_size, log_max_index, log_pointer
FROM system.replicas
WHERE table = 'events';

-- Detailed queue: what each replica is waiting to do
SELECT type, create_time, num_tries, last_exception
FROM system.replication_queue
WHERE table = 'events' ORDER BY create_time DESC;

Useful metrics: absolute_delay (seconds behind the leader), queue_size (pending operations), and future_parts (parts about to materialize). A non-empty last_exception on any row of system.replication_queue is a strong signal that replication is stuck - common causes include "part not found on source replica," Keeper timeouts, and disk-full errors.

Pulse monitors these tables continuously across every replica in a cluster, alerts on replication lag drift, and performs AI-driven root-cause analysis when replication stalls - distinguishing between Keeper saturation, network partition between specific replicas, and per-disk I/O bottlenecks. For repeat failures like "no active replicas" or "part not found," Pulse can auto-trigger SYSTEM RESTART REPLICA and SYSTEM SYNC REPLICA after confirming it is safe.

Frequently Asked Questions

Q: What is the difference between ReplicatedMergeTree and MergeTree?
A: MergeTree is a local-only table engine - each node's data is independent. ReplicatedMergeTree adds Keeper-coordinated replication so that data parts are automatically synchronized across replicas. The CREATE TABLE signature is the same except ReplicatedMergeTree requires a Keeper path and a replica name as the engine arguments.

Q: Does ReplicatedMergeTree require ZooKeeper?
A: It requires a coordination service - either ClickHouse Keeper (recommended, ships with ClickHouse) or ZooKeeper version 3.4.5 or newer. ClickHouse Keeper speaks the ZooKeeper wire protocol and is a drop-in replacement; new deployments should standardize on Keeper.

Q: How many replicas should I run per shard?
A: Two replicas per shard tolerates a single node failure; three replicas tolerates one failure while still being able to satisfy insert_quorum = 2. More than three replicas is rare and usually serves read-throughput goals rather than availability.

Q: How do I guarantee an INSERT is durable across replicas before acknowledging?
A: Set insert_quorum = N in the session or in a settings profile. The insert waits until N replicas confirm the part. Use insert_quorum = 'auto' to require a majority. Combine with select_sequential_consistency = 1 for read-after-write consistency.

Q: What happens to data on a replica that has been offline for a long time?
A: When the replica comes back, it reads the replication queue from Keeper and pulls missing parts from other replicas. If the offline window exceeded merge_tree_replicated_max_replicated_logs_to_keep (default 1000) the replica becomes "lost" and needs SYSTEM RESTORE REPLICA to rebuild from a peer.

Q: Can I convert a non-replicated MergeTree table to ReplicatedMergeTree?
A: Yes, with an in-place conversion via ATTACH TABLE: create a new ReplicatedMergeTree with the same schema, detach the partitions from the old table, and attach them to the new one. There is also experimental support for in-place engine swap; verify behavior on your ClickHouse version before relying on it in production.

Q: How do I monitor ReplicatedMergeTree replication lag?
A: Query system.replicas for absolute_delay and queue_size, and check system.replication_queue for stuck entries with non-empty last_exception. Lag spikes that don't recover within minutes usually indicate Keeper pressure, slow disk on the lagging replica, or network issues between specific nodes.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.