ClickHouse DB::Exception: Table is being restarted

Q: Is there an alternative to SYSTEM RESTART REPLICA for fixing replication issues?

For many replication problems, SYSTEM SYNC REPLICA my_table is a less disruptive option. It waits for the replication queue to be processed without taking the table offline. Only use RESTART REPLICA when the replica's replication state is genuinely corrupted.

The "DB::Exception: Table is being restarted" error occurs when you attempt to access a table that is in the middle of a SYSTEM RESTART REPLICA operation. The error code TABLE_IS_BEING_RESTARTED indicates a transient condition -- the table is temporarily unavailable while ClickHouse reinitializes its replication state. Once the restart completes, the table becomes accessible again.

Impact

All queries against the affected table will fail with this error for the duration of the restart. This includes SELECT, INSERT, and ALTER operations. The impact is typically brief (seconds to minutes), but it affects availability if the table serves real-time traffic.

Common Causes

Explicit SYSTEM RESTART REPLICA command -- an administrator or script intentionally restarted the replica to fix replication issues.
Automated recovery procedures -- monitoring systems that issue SYSTEM RESTART REPLICA when they detect replication lag or ZooKeeper session loss.
ClickHouse internal recovery -- after a ZooKeeper session timeout, ClickHouse may automatically restart replicas.
Concurrent access during maintenance -- queries arriving while a maintenance window operation is restarting replicas.

Troubleshooting and Resolution Steps

Wait and retry. This is a transient error. The table will become available once the restart finishes. A retry with short backoff (1-2 seconds) is usually sufficient:
```
-- Simply retry the query after a brief pause
```

Check if a SYSTEM RESTART REPLICA is in progress:

SELECT query, elapsed
FROM system.processes
WHERE query LIKE '%RESTART REPLICA%';

Monitor the replication queue to see when the restart completes:

SELECT
    database,
    table,
    is_currently_executing,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE table = 'my_table'
ORDER BY create_time DESC
LIMIT 10;

Check ZooKeeper connectivity. If the restart is taking unusually long, ZooKeeper may be slow or unreachable:
```
SELECT * FROM system.zookeeper WHERE path = '/';
```

Review the ClickHouse server logs for the restart reason:

grep -i "restart replica" /var/log/clickhouse-server/clickhouse-server.log | tail -20

If the restart seems stuck, check for ZooKeeper session issues and consider restarting the ClickHouse server as a last resort.

Best Practices

Schedule SYSTEM RESTART REPLICA operations during maintenance windows or low-traffic periods.
Implement retry logic with exponential backoff in applications that query replicated tables.
Use a load balancer that can route queries to healthy replicas while one is restarting.
Monitor replica health proactively so that manual restarts are rarely needed.
Set appropriate ZooKeeper session timeouts to balance between false restarts and quick failure detection.

Frequently Asked Questions

Q: How long does a replica restart typically take?
A: Usually a few seconds for healthy tables. Tables with large replication queues or slow ZooKeeper connections may take longer. If it exceeds a few minutes, investigate ZooKeeper health.

Q: Can I query other tables in the same database while one table is restarting?
A: Yes. The restart only affects the specific table being restarted. Other tables in the same database remain fully accessible.

Q: Should I automatically restart replicas when replication lag is detected?
A: Only as a last resort. Replication lag is usually caused by heavy load or resource contention, not a broken replica state. Restarting the replica will not help with resource issues and adds a brief outage. Investigate the root cause first.

Q: Is there an alternative to SYSTEM RESTART REPLICA for fixing replication issues?
A: For many replication problems, SYSTEM SYNC REPLICA my_table is a less disruptive option. It waits for the replication queue to be processed without taking the table offline. Only use RESTART REPLICA when the replica's replication state is genuinely corrupted.