ClickHouse DB::Exception: Failed to sync backup or restore

The "DB::Exception: Failed to sync backup or restore" error occurs when ClickHouse cannot coordinate a backup or restore operation across replicas in a cluster. The FAILED_TO_SYNC_BACKUP_OR_RESTORE error code indicates that the distributed synchronization mechanism -- which ensures all replicas participate in a cluster-wide backup or restore -- encountered a failure, timeout, or disagreement between nodes.

Impact

The backup or restore operation fails for the cluster. Some replicas may have partially completed their portion of the work while others have not started or have failed. This leaves the cluster in an inconsistent backup state that requires manual intervention. In a disaster recovery scenario, a failed sync can significantly delay the restoration process.

Common Causes

ZooKeeper or ClickHouse Keeper unavailability -- the coordination service is down, overloaded, or experiencing network partitions.
One or more replicas are offline -- a replica that is expected to participate in the backup or restore is unreachable.
Timeout exceeded -- the operation took longer than the configured sync timeout, causing ClickHouse to abort.
Network issues between replicas -- intermittent connectivity problems prevent replicas from exchanging sync signals.
Conflicting operations -- another backup, restore, or heavy mutation is running concurrently and blocking the sync.
Replica lag -- a replica is significantly behind in replication, and the sync cannot proceed until it catches up.

Troubleshooting and Resolution Steps

Check ZooKeeper/Keeper health:

SELECT * FROM system.zookeeper WHERE path = '/';

echo ruok | nc zookeeper-host 2181

Verify all replicas are online:

SELECT host_name, is_active, is_readonly
FROM system.clusters
WHERE cluster = 'my_cluster';

Check for replication lag:

SELECT database, table, replica_name, is_session_expired,
       absolute_delay, queue_size
FROM system.replicas
WHERE absolute_delay > 0 OR queue_size > 0;

Increase the sync timeout if the operation is slow but not fundamentally broken:

BACKUP DATABASE my_db ON CLUSTER 'my_cluster'
TO Disk('backups', 'cluster_backup')
SETTINGS backup_restore_keeper_max_retries = 20,
         backup_restore_keeper_retry_initial_backoff_ms = 500;

Wait for ongoing operations to complete. Check for active mutations or merges that may be blocking:
```
SELECT * FROM system.mutations WHERE is_done = 0;
SELECT * FROM system.merges WHERE 1;
```
Run the backup on individual replicas instead of using ON CLUSTER if cluster-wide coordination continues to fail:
```
-- Run on each replica separately
BACKUP DATABASE my_db TO Disk('backups', 'replica1_backup');
```
Restart the coordination by clearing stale ZooKeeper nodes if the sync state is stuck:
```
SYSTEM RESTART REPLICA my_db.my_table;
```

Best Practices

Monitor ZooKeeper/Keeper health continuously -- backup sync failures are often a symptom of Keeper issues.
Ensure all replicas are healthy and caught up before initiating cluster-wide backups.
Avoid running heavy operations (large mutations, schema changes) during backup windows.
Set appropriate timeouts for cluster backup operations based on your data size and network characteristics.
Have a fallback plan for per-replica backups if cluster-wide sync consistently fails.
Use ON CLUSTER backups only when you need coordinated point-in-time snapshots; otherwise, independent per-replica backups may be simpler and more reliable.

Frequently Asked Questions

Q: Do I need to use ON CLUSTER for backups in a replicated setup?
A: Not necessarily. You can back up individual replicas independently. ON CLUSTER is useful when you need a coordinated snapshot across all replicas, but it adds complexity and a synchronization requirement.

Q: What happens if one replica fails during a cluster backup?
A: The entire cluster backup fails with the FAILED_TO_SYNC_BACKUP_OR_RESTORE error. The backup is not considered complete unless all participating replicas succeed.

Q: How long does the sync timeout last by default?
A: The default timeout depends on your ClickHouse version and configuration. You can adjust it using backup_restore_keeper_max_retries and related settings. Check your server configuration for the current values.

Q: Can network latency between data centers cause this error?
A: Yes. If your cluster spans data centers with high latency, the sync protocol may time out. Consider increasing timeouts or running backups per-datacenter rather than globally.