ClickHouse DB::Exception: Checksum doesn't match

The "DB::Exception: Checksum doesn't match" error in ClickHouse indicates that a block of data failed its integrity check. Every compressed block in ClickHouse includes a checksum, and the CHECKSUM_DOESNT_MATCH error fires when the computed checksum of the decompressed data does not match the expected value stored alongside it. This is a strong indicator of data corruption, whether caused by hardware faults, incomplete writes, or transmission errors.

Impact

This error causes the affected query to fail immediately. If corruption is localized to specific parts, only queries that touch those parts will be impacted. However, background merges that involve corrupted parts will also fail, which can lead to part count growth and eventual "too many parts" errors. In replicated setups, this can block replication progress for the affected partition.

Common Causes

Disk hardware failure or degradation, including bad sectors, failing SSDs, or unreliable storage controllers
Data corruption during network transfer in replicated environments
Incomplete or interrupted writes due to power loss or process crash during a merge or insert
Filesystem corruption caused by unexpected shutdowns or kernel bugs
Faulty RAM causing bit flips during compression or decompression
Manual tampering with data files on disk

Troubleshooting and Resolution Steps

Identify the affected table and parts from the error message. The error typically includes the file path of the corrupted part.
Check disk health:
```
smartctl -a /dev/sda
dmesg | grep -i "error\|fault\|bad"
```
Look for signs of hardware issues such as reallocated sectors or I/O errors.
Verify the part using ClickHouse's built-in check:
```
CHECK TABLE your_database.your_table PARTITION 'partition_id';
```
This will report which parts have checksum mismatches.
For replicated tables, the simplest recovery is to let ClickHouse re-fetch the corrupted part from another replica:
```
ALTER TABLE your_table DETACH PART 'corrupted_part_name';
```
ClickHouse will automatically download a fresh copy of the part from a healthy replica.

For non-replicated tables, restore from backup:

# Remove the corrupted part directory
rm -rf /var/lib/clickhouse/data/your_database/your_table/corrupted_part_name/
# Restore from backup

Then run ATTACH or restart ClickHouse to pick up the restored data.

If no backup or replica is available, you can attempt to skip the corrupted part:
```
ALTER TABLE your_table DROP PART 'corrupted_part_name';
```
This will result in data loss for that part, but it allows the rest of the table to function normally.
Run a memory test if you suspect RAM issues:
```
memtest86+
```

Best Practices

Use replicated table engines (ReplicatedMergeTree) so that corrupted parts can be automatically recovered from healthy replicas.
Enable and monitor RAID or use storage systems with built-in redundancy.
Set up alerting on system.part_log for events with error > 0 to catch corruption early.
Maintain regular backups using ALTER TABLE ... FREEZE or ClickHouse's backup functionality.
Run periodic CHECK TABLE commands as part of your maintenance routine.
Use ECC RAM to prevent bit-flip corruption.

Frequently Asked Questions

Q: Does a checksum mismatch always mean disk corruption?
A: Not always. It can also be caused by network issues during replication, RAM errors, or interrupted writes. However, disk corruption is the most common cause and should be investigated first.

Q: Will ClickHouse automatically repair corrupted parts on replicated tables?
A: ClickHouse does not automatically repair corrupted parts, but if you detach the corrupted part, the replication mechanism will fetch a new copy from a healthy replica. You can also use SYSTEM RESTORE REPLICA in some scenarios.

Q: Can I prevent this error entirely?
A: You cannot eliminate it completely since hardware can always fail, but using ECC RAM, reliable storage, and replication significantly reduces the risk. Checksums exist precisely to detect these issues before they cause silent data corruption.

Q: Is it safe to ignore this error if it only happens occasionally?
A: No. Even occasional checksum failures suggest an underlying hardware or infrastructure issue that is likely to worsen over time. Investigate and resolve the root cause promptly.