NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse DB::Exception: ZSTD decoder failed

The "DB::Exception: ZSTD decoder failed" error in ClickHouse means that the ZSTD decompression library could not decode a compressed block of data. The ZSTD_DECODER_FAILED error typically indicates that the compressed data is corrupted or was not produced by a compatible ZSTD encoder. Since ZSTD is widely used as the default or recommended codec, encountering this error usually signals a data integrity problem that needs immediate attention.

Impact

Queries reading from affected parts will fail. This can impact both user-facing queries and background operations like merges. If the corruption is widespread across multiple parts, large portions of a table may become unreadable. Replication queues may stall if the corrupted parts are involved in pending replication tasks.

Common Causes

  1. Data corruption on disk due to storage hardware failures
  2. Bit rot or silent data corruption on aging storage media
  3. Partial writes caused by a server crash or power failure during insert or merge operations
  4. Network corruption during replication that bypassed higher-level integrity checks
  5. Memory corruption (faulty RAM) that produced invalid compressed output during the original write
  6. Attempting to read data written by an incompatible ZSTD version (very rare with ClickHouse's bundled library)

Troubleshooting and Resolution Steps

  1. Identify the corrupted part from the error message. The log typically includes the file path:

    grep -i "ZSTD_DECODER_FAILED\|ZSTD decoder" /var/log/clickhouse-server/clickhouse-server.log
    
  2. Run a table check to identify all corrupted parts:

    CHECK TABLE your_database.your_table;
    
  3. For ReplicatedMergeTree tables, detach the corrupted part and let replication recover it:

    ALTER TABLE your_table DETACH PART 'part_name';
    

    ClickHouse will fetch a healthy copy from another replica automatically.

  4. Check disk health:

    smartctl -a /dev/sda
    dmesg | grep -i "error\|fault"
    
  5. For non-replicated tables, restore the part from backup:

    # Remove corrupted part
    rm -rf /var/lib/clickhouse/data/db/table/part_name/
    # Restore from backup, then:
    # Restart ClickHouse or run ATTACH
    
  6. If no backup exists, drop the corrupted part as a last resort:

    ALTER TABLE your_table DROP PART 'part_name';
    
  7. Investigate the root cause -- run memory diagnostics and check storage health to prevent recurrence:

    memtest86+  -- for memory testing
    fsck /dev/sda1  -- filesystem check (run when unmounted)
    

Best Practices

  • Always use ReplicatedMergeTree in production to enable automatic recovery from corruption.
  • Implement monitoring for disk health metrics and ClickHouse's system.part_log to catch issues early.
  • Use ECC RAM to prevent memory-related corruption during compression.
  • Maintain regular backups as a safety net for non-replicated tables.
  • Set up RAID or use storage with built-in redundancy to mitigate single-disk failures.

Frequently Asked Questions

Q: Can a ClickHouse upgrade cause ZSTD decoder failures?
A: This is extremely unlikely. ZSTD maintains backward compatibility, and ClickHouse bundles a tested version of the library. However, if data was written with a custom or patched ClickHouse build using a non-standard ZSTD version, compatibility issues could theoretically arise.

Q: How can I tell if this is a one-time corruption or an ongoing problem?
A: Check system.part_log for a pattern of errors. If you see multiple parts failing over time, the underlying cause (disk, memory, or controller) is likely still active. A single occurrence might be a one-time event from a crash or power loss.

Q: Should I switch away from ZSTD to prevent this?
A: Switching codecs would not prevent the root cause -- if your storage or memory is corrupting data, any codec will be affected. Address the hardware issue rather than changing the codec. ZSTD is well-tested and reliable when the underlying infrastructure is healthy.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.