What is ReplicatedMergeTree?
ReplicatedMergeTree is a MergeTree-family engine that replicates data parts across multiple ClickHouse servers using ClickHouse Keeper (or ZooKeeper 3.4.5+) as the coordination service. Each replica writes parts locally and registers them in Keeper; other replicas detect new entries in the replication queue and pull the parts from the original writer. The engine is the foundation for high-availability ClickHouse: combined with a Distributed table on top, it gives you per-shard fault tolerance and multi-replica read parallelism. Every Replicated* engine in the MergeTree family (ReplicatedReplacingMergeTree, ReplicatedAggregatingMergeTree, etc.) follows the same coordination pattern.
How ReplicatedMergeTree Works
A ReplicatedMergeTree table is created with a Keeper path and a replica name, both supporting {shard}, {replica}, {database}, {table} macros defined in config.xml:
CREATE TABLE events ON CLUSTER my_cluster
(
event_time DateTime,
event_type LowCardinality(String),
user_id UInt64,
payload String
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events', -- Keeper path (same per shard)
'{replica}' -- replica name (unique per replica)
)
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id);
When a replica receives an INSERT, it writes the part to its own disk, registers a GET_PART log entry in Keeper, and acknowledges the client. By default the insert is acknowledged after a single replica writes; setting insert_quorum = N (or insert_quorum = 'auto') makes the insert wait until at least N replicas confirm the part. Background merges follow the same protocol: a designated leader (any replica) decides a merge, registers a MERGE_PARTS entry, and replicas execute it locally. Deduplication is automatic for the last 100 blocks per partition based on block hash.
The Keeper directory tree under '/clickhouse/tables/{shard}/events' stores replica registrations, the replication log, mutation queue, leader election, and metadata. Losing Keeper means losing the ability to coordinate new writes; existing replicas continue to serve reads.
ReplicatedMergeTree vs MergeTree
| Feature | MergeTree | ReplicatedMergeTree |
|---|---|---|
| Storage | Local only | Local + replicated via Keeper |
| Coordination service | None | ClickHouse Keeper or ZooKeeper |
| Insert path | Direct to disk | Disk + Keeper log entry |
| HA on node failure | None - data is local | Reads continue from peer replicas |
| Insert acknowledgement | After local write | After local write (or quorum if set) |
| Background merges | Local | Coordinated through Keeper log |
| Deduplication on INSERT | None | Last 100 block hashes per partition |
Required CREATE TABLE args |
None | Keeper path + replica name |
A common production layout is a Distributed table over a sharded set of ReplicatedMergeTree tables: the distributed table fans out reads/writes across shards, and each shard has 2-3 ReplicatedMergeTree replicas for failover.
Common Pitfalls
- Reusing the same Keeper path across two different physical tables - this corrupts replication state. Always include
{shard}and a unique table name segment in the path. - Forgetting
ON CLUSTERwhen runningCREATE TABLE- the table will only exist on one replica and replication will not start until you create it on every other node. - Running ZooKeeper on the same hosts as ClickHouse without isolating I/O. ZK write latency directly bounds insert throughput.
- Setting
insert_quorum = Nhigher than the number of healthy replicas - inserts hang until a replica recovers or you lower the quorum. - Restoring a single replica from backup without
SYSTEM SYNC REPLICAafterward. Replication queue lag silently grows until you trigger a re-sync.
Monitoring ReplicatedMergeTree
Two system tables are essential:
-- Per-replica health, lag, and queue depth
SELECT database, table, is_leader, is_readonly, future_parts,
absolute_delay, queue_size, log_max_index, log_pointer
FROM system.replicas
WHERE table = 'events';
-- Detailed queue: what each replica is waiting to do
SELECT type, create_time, num_tries, last_exception
FROM system.replication_queue
WHERE table = 'events' ORDER BY create_time DESC;
Useful metrics: absolute_delay (seconds behind the leader), queue_size (pending operations), and future_parts (parts about to materialize). A non-empty last_exception on any row of system.replication_queue is a strong signal that replication is stuck - common causes include "part not found on source replica," Keeper timeouts, and disk-full errors.
Pulse monitors these tables continuously across every replica in a cluster, alerts on replication lag drift, and performs AI-driven root-cause analysis when replication stalls - distinguishing between Keeper saturation, network partition between specific replicas, and per-disk I/O bottlenecks. For repeat failures like "no active replicas" or "part not found," Pulse can auto-trigger SYSTEM RESTART REPLICA and SYSTEM SYNC REPLICA after confirming it is safe.
Frequently Asked Questions
Q: What is the difference between ReplicatedMergeTree and MergeTree?
A: MergeTree is a local-only table engine - each node's data is independent. ReplicatedMergeTree adds Keeper-coordinated replication so that data parts are automatically synchronized across replicas. The CREATE TABLE signature is the same except ReplicatedMergeTree requires a Keeper path and a replica name as the engine arguments.
Q: Does ReplicatedMergeTree require ZooKeeper?
A: It requires a coordination service - either ClickHouse Keeper (recommended, ships with ClickHouse) or ZooKeeper version 3.4.5 or newer. ClickHouse Keeper speaks the ZooKeeper wire protocol and is a drop-in replacement; new deployments should standardize on Keeper.
Q: How many replicas should I run per shard?
A: Two replicas per shard tolerates a single node failure; three replicas tolerates one failure while still being able to satisfy insert_quorum = 2. More than three replicas is rare and usually serves read-throughput goals rather than availability.
Q: How do I guarantee an INSERT is durable across replicas before acknowledging?
A: Set insert_quorum = N in the session or in a settings profile. The insert waits until N replicas confirm the part. Use insert_quorum = 'auto' to require a majority. Combine with select_sequential_consistency = 1 for read-after-write consistency.
Q: What happens to data on a replica that has been offline for a long time?
A: When the replica comes back, it reads the replication queue from Keeper and pulls missing parts from other replicas. If the offline window exceeded merge_tree_replicated_max_replicated_logs_to_keep (default 1000) the replica becomes "lost" and needs SYSTEM RESTORE REPLICA to rebuild from a peer.
Q: Can I convert a non-replicated MergeTree table to ReplicatedMergeTree?
A: Yes, with an in-place conversion via ATTACH TABLE: create a new ReplicatedMergeTree with the same schema, detach the partitions from the old table, and attach them to the new one. There is also experimental support for in-place engine swap; verify behavior on your ClickHouse version before relying on it in production.
Q: How do I monitor ReplicatedMergeTree replication lag?
A: Query system.replicas for absolute_delay and queue_size, and check system.replication_queue for stuck entries with non-empty last_exception. Lag spikes that don't recover within minutes usually indicate Keeper pressure, slow disk on the lagging replica, or network issues between specific nodes.
Related Reading
- MergeTree Engine - the base engine without replication
- AggregatingMergeTree - aggregating variant that also has a Replicated form
- insert_quorum Setting - durability guarantees on INSERT
- No Active Replicas Error - common Replicated* failure mode
- Cannot Create New ZooKeeper Session - Keeper connectivity errors
- ClickHouse Documentation Hub - index of all ClickHouse KB pages
- ClickHouse Settings Profile - applying per-user quorum and timeout settings