ClickHouse DB::Exception: Cannot close file

The "DB::Exception: Cannot close file" error in ClickHouse indicates that a call to close a file descriptor failed at the OS level. Represented by the CANNOT_CLOSE_FILE error code, this is a relatively rare condition that typically points to serious underlying problems with the storage subsystem or filesystem. While closing a file might seem like a trivial operation, a failure here can signal that data was not fully flushed to disk.

Impact

A file close failure can have the following effects:

The data written to the file may not have been fully persisted, risking partial writes
File descriptor leaks if the descriptor remains open after the failed close
Ongoing merges or insert operations may be aborted
Repeated occurrences can gradually exhaust the file descriptor pool, leading to cascading failures

Common Causes

Underlying storage device errors (disk failure, NFS server disconnection, iSCSI timeout)
Filesystem corruption that prevents proper metadata updates during close
Kernel bugs or issues with specific filesystem drivers (especially with FUSE-based filesystems)
Network filesystem (NFS, CIFS) losing connectivity during the close operation
The file was already closed or the descriptor was invalidated by another thread or process
Resource pressure causing the kernel to fail deferred write operations during close

Troubleshooting and Resolution Steps

Check kernel and system logs for storage errors:
```
dmesg | tail -50
journalctl -k --since "30 minutes ago" | grep -i "error\|fail"
```
Look for I/O errors, device timeouts, or filesystem warnings.

Verify the storage device is healthy:

smartctl -a /dev/sda
cat /sys/block/sda/device/state

For network storage, confirm the mount is still active:

mount | grep /var/lib/clickhouse
stat /var/lib/clickhouse

Check for file descriptor leaks:

ls /proc/$(pidof clickhouse-server)/fd | wc -l
cat /proc/$(pidof clickhouse-server)/limits | grep "open files"

A high count relative to the limit suggests a leak or excessive open files.

Review ClickHouse server logs:
```
grep "Cannot close" /var/log/clickhouse-server/clickhouse-server.err.log
```
Note the file path and correlate it with the affected table or operation.
Check filesystem consistency: If you suspect corruption, schedule a filesystem check during downtime:
```
sudo umount /var/lib/clickhouse
sudo fsck -f /dev/sdX
sudo mount /var/lib/clickhouse
```
Restart ClickHouse after addressing the storage issue. The server will reopen files as needed during startup.

Best Practices

Use local, enterprise-grade storage (SSDs with power-loss protection) rather than network filesystems for ClickHouse data when possible
Monitor storage device health with SMART monitoring and automated alerts
If using NFS or other network filesystems, ensure stable network connectivity and configure appropriate mount timeouts
Keep the operating system kernel up to date to benefit from filesystem driver fixes
Set up file descriptor monitoring to detect leaks early
Maintain redundant replicas so that a storage failure on one node does not cause data loss

Frequently Asked Questions

Q: Is data lost when this error occurs?
A: It depends on the operation. If the close failure happens after data was successfully written and fsynced, no data is lost. However, if deferred writes were pending during the close, some data may not have reached disk. ClickHouse checksums will detect any inconsistency on the next read.

Q: Can this error happen with cloud-managed disks (EBS, Persistent Disk)?
A: It is uncommon but possible, especially during cloud infrastructure incidents or if the disk becomes detached. Cloud provider status pages and instance logs can help confirm.

Q: Should I be concerned if this happens once?
A: A single occurrence could be a transient storage hiccup. However, you should still investigate the cause. Repeated occurrences indicate a persistent problem that needs attention before it escalates.

Q: Does this error affect all tables or just the one being operated on?
A: The error is specific to the file being closed at the time. Other tables are not directly affected unless the root cause (e.g., a failing disk) impacts all files on the same volume.