The "DB::Exception: Insert was deduplicated" error (or informational message) in ClickHouse indicates that an INSERT operation was skipped because it was identified as a duplicate of a previous insert. The error code is INSERT_WAS_DEDUPLICATED. This is a feature of ReplicatedMergeTree tables: ClickHouse stores a hash of recently inserted blocks and silently drops (deduplicates) any block that matches a previously seen hash.
Impact
The INSERT appears to succeed (or is silently ignored), but no new data is actually written. While this is often the desired behavior for exactly-once delivery semantics, it can be surprising when you intentionally want to insert the same data again. If you are not aware of deduplication, it may look like data is being lost.
Common Causes
- Retried INSERT with the same data -- A client retried an INSERT after a timeout or transient error, but the original insert had actually succeeded. The retry is deduplicated.
- Identical blocks inserted intentionally -- Inserting the same data twice on purpose (e.g., replaying from a Kafka topic) triggers deduplication.
- Small insert batches with identical content -- If multiple small inserts happen to produce blocks with the same hash, later ones are deduplicated.
- Application restart replaying the same data -- An application that does not track its insert offset may replay previously inserted data on restart.
- ETL pipeline re-execution -- Running the same ETL job twice without clearing the deduplication window.
Troubleshooting and Resolution Steps
Confirm deduplication is the cause. Check the ClickHouse server log for deduplication messages:
grep -i "deduplic" /var/log/clickhouse-server/clickhouse-server.logCheck the deduplication window setting:
SELECT name, value FROM system.merge_tree_settings WHERE name = 'replicated_deduplication_window' OR name = 'replicated_deduplication_window_seconds';If you need to insert duplicate data intentionally, disable deduplication for the session:
SET insert_deduplicate = 0; INSERT INTO your_replicated_table VALUES (...);Alternatively, modify the data slightly to produce a different block hash. Adding a unique column value (like a UUID or timestamp) ensures each block is unique:
INSERT INTO your_table (data_col, insert_id) SELECT data_col, generateUUIDv4() FROM input_source;Reduce the deduplication window if you frequently need to re-insert similar data:
ALTER TABLE your_table MODIFY SETTING replicated_deduplication_window = 10;Reset the deduplication token instead of relying on block hashes. Setting a distinct
insert_deduplication_tokenper logical batch gives you explicit control over what counts as a duplicate:INSERT INTO your_table SETTINGS insert_deduplication_token = 'batch-2026-06-08-001' VALUES (...);Reusing the same token deduplicates; a new token forces the insert to be treated as fresh.
Inspect the replication queue for the table to confirm replication is healthy and unrelated to the dedup behavior:
SELECT * FROM system.replication_queue WHERE database = 'your_db' AND table = 'your_table';
Best Practices
- Understand that deduplication in ReplicatedMergeTree is a feature, not a bug. It provides exactly-once insert semantics, which is valuable for fault-tolerant data pipelines.
- Design your insert batches to have distinct content when each batch should be stored independently. Include a unique identifier column if needed.
- Track insert progress in your application (e.g., Kafka offsets) so you know whether a retry is actually needed.
- Set
insert_deduplicate = 0only when you explicitly need to bypass deduplication, and restore it afterward. - Monitor deduplication events in logs to detect cases where your pipeline is unnecessarily retrying inserts.
Frequently Asked Questions
Q: Does deduplication apply to non-replicated MergeTree tables?
A: By default, deduplication is a ReplicatedMergeTree feature that uses ZooKeeper/Keeper to track block hashes. Non-replicated MergeTree tables do not deduplicate inserts unless you explicitly configure non_replicated_deduplication_window.
Q: How long does ClickHouse remember previous inserts for deduplication?
A: The deduplication window is controlled by replicated_deduplication_window (default 10000 blocks) and replicated_deduplication_window_seconds (default 3600 seconds / 1 hour). Blocks older than either limit are forgotten. (Defaults have changed across versions, so confirm them on your deployment via system.merge_tree_settings.)
Q: Does deduplication work across different replicas?
A: Yes. Deduplication hashes are stored in ZooKeeper/Keeper and are shared across all replicas of the same table. An insert to replica A will be deduplicated on replica B if the same block is sent.
Q: Can I see which blocks were deduplicated?
A: Check the ClickHouse server logs for deduplication entries. You can also query ZooKeeper/Keeper paths for the table's dedup log, though this requires knowledge of the ZooKeeper path structure.