ClickHouse Filesystems: ext4, XFS, ZFS, BTRFS, and Network Storage

Q: Can I run ClickHouse on ZFS?

Yes, with care. Cap zfs_arc_max , disable ZFS-level compression, use recordsize=1M , and prefer ZFS 2.2 or newer for renameat2 support.

Q: What mount options should I use on ext4?

defaults,noatime,nodiratime . The noatime flag is the important one because it eliminates metadata writes on every read.

ClickHouse runs on any POSIX filesystem that supports hard links and soft links, which it requires. It uses O_DIRECT to bypass the page cache for large reads and depends on renameat2 for atomic operations in the Atomic database engine. Beyond those requirements, the filesystem choice has a measurable effect on stability and throughput because ClickHouse writes large compressed parts and reads them back sequentially at high bandwidth. The short version: use ext4 unless you have a specific reason not to.

Workload Characteristics

ClickHouse stores data compressed (LZ4 by default, ZSTD when configured) so reads from disk drive a high throughput, moderate IOPS profile. Indexes, primary keys, and small metadata files remain uncompressed and are read on every query. The filesystem must handle:

Large sequential reads from compressed data parts
Many small reads from .mrk, .idx, and column.bin metadata
Frequent file creation and rename during merges
Hard link creation during ALTER and freeze operations
Atomic rename via renameat2 for the Atomic database engine

ext4: The Default Choice

ext4 is the safe choice. There are no known compatibility issues with ClickHouse, hard links and renameat2 are supported, and performance is predictable. Kernel 3.15 is the floor, but modern kernels (5.x or newer) are preferred. If you have no reason to pick something else, pick ext4.

A reasonable mount configuration:

/dev/nvme1n1 /var/lib/clickhouse ext4 defaults,noatime,nodiratime 0 0

noatime and nodiratime cut unnecessary metadata writes on every file access, which matters under ClickHouse's read-heavy access pattern.

XFS: Not Recommended

XFS has produced multiple reports of degraded ClickHouse performance under sustained load. Symptoms include kernel log entries like task XYZ blocked for more than 120 seconds, ClickHouse becoming unresponsive while XFS daemon work runs, and I/O kernel utilization spiking to 99% on otherwise healthy hardware. Some deployments have traced unexpectedly poor cluster performance directly to XFS.

Kernel 4.0 is the minimum; older kernels are known to behave worse. There is no agreed tuning that resolves these issues at scale. If you are choosing today, choose ext4. If you already run XFS and it works, the recommendation is not urgent migration but consider ext4 for new nodes.

ZFS: Workable With Tuning

ZFS works with ClickHouse but requires extra RAM and tuning. The main pitfall is the ZFS adaptive replacement cache (ARC) competing with ClickHouse for memory. ARC defaults can consume far more memory than is safe to share with a memory-hungry analytical database.

Tune zfs_arc_max so that ARC plus the configured ClickHouse memory budget does not exceed available RAM. A common split is 80% to ClickHouse and 10% to ARC, leaving the rest for the OS:

# On a 64 GB host, cap ARC at ~6.4 GB
echo $((6 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max

Compression at the ZFS layer competes with ClickHouse's own compression. Disable filesystem compression on /var/lib/clickhouse since ClickHouse data is already compressed. Set recordsize=1M to match large sequential reads, and disable atime.

Two more constraints:

ZFS versions before 2.2 do not implement renameat2, which breaks the Atomic database engine on older ClickHouse versions and any deployment that relies on atomic rename.
Older ClickHouse releases combined with O_DIRECT on ZFS produced subtle issues. Current ClickHouse versions handle this better but the combination is not as well tested as ext4.

BTRFS and ReiserFS: Limited Experience

Both filesystems have been reported as working but with limited deployment data. BTRFS in particular has snapshot and subvolume features that look attractive but interact poorly with ClickHouse's own hard-link-based freeze and backup mechanisms. There is not enough operational evidence to recommend either over ext4.

Network Filesystems: NFS and EFS

NFS works for ClickHouse storage but has two hard constraints. Throughput is bounded by network bandwidth, and file operations per second are low because of NFS locking semantics. ClickHouse does many small metadata operations during merges and inserts, and these latencies add up.

AWS EFS is NFS v4.1 under the hood and has the same limits. EFS is reasonable as cold storage, paired with local EBS or NVMe for hot data through ClickHouse's tiered storage policies.

A frequent mistake is assuming a shared network filesystem provides replication. ClickHouse does not support multiple replicas pointing at the same data on a network disk. Replication is at the ClickHouse layer through ReplicatedMergeTree, not the filesystem layer.

Distributed and Cluster Filesystems

Lustre, Ceph, MooseFS, and GlusterFS have been deployed under ClickHouse but with sparse public documentation. Lustre in particular requires fast networking and had data corruption reports on older ClickHouse versions related to O_DIRECT and async I/O. If you are running one of these for reasons outside ClickHouse, test thoroughly with current ClickHouse releases before committing.

Recommendation Summary

Filesystem	Recommendation
ext4	Default choice for all production deployments
XFS	Avoid for new deployments
ZFS	Workable with strict ARC tuning and ZFS 2.2 or newer
BTRFS	Not enough data to recommend
NFS / EFS	Cold tier only, not for hot data
Lustre / Ceph	Only if mandated, test thoroughly

Common Pitfalls

Enabling filesystem-level compression on top of ClickHouse compression, doubling CPU cost for no gain.
Leaving ZFS ARC at default sizes on a host that also runs ClickHouse, leading to OOM kills.
Mounting /var/lib/clickhouse without noatime, generating spurious metadata writes.
Assuming a shared NFS mount enables multi-replica access. It does not.
Running XFS and trying to tune around stalls instead of switching to ext4.

Frequently Asked Questions

Q: Should I use ext4 or XFS for ClickHouse? A: ext4. XFS has documented stall and performance issues with ClickHouse workloads and no reliable tuning to fix them.

Q: Can I run ClickHouse on ZFS? A: Yes, with care. Cap zfs_arc_max, disable ZFS-level compression, use recordsize=1M, and prefer ZFS 2.2 or newer for renameat2 support.

Q: Will ClickHouse work on AWS EFS? A: Functionally yes, but EFS is too slow for hot data because of NFS locking overhead. Use it as a cold tier behind EBS.

Q: Does using a shared filesystem replace ClickHouse replication? A: No. ClickHouse expects each replica to have its own storage. Replication happens at the table engine level via ReplicatedMergeTree.

Q: What mount options should I use on ext4? A: defaults,noatime,nodiratime. The noatime flag is the important one because it eliminates metadata writes on every read.