The "DB::Exception: Table is being restarted" error occurs when you attempt to access a table that is in the middle of a SYSTEM RESTART REPLICA operation. The error code TABLE_IS_BEING_RESTARTED indicates a transient condition -- the table is temporarily unavailable while ClickHouse reinitializes its replication state. Once the restart completes, the table becomes accessible again.
Impact
All queries against the affected table will fail with this error for the duration of the restart. This includes SELECT, INSERT, and ALTER operations. The impact is typically brief (seconds to minutes), but it affects availability if the table serves real-time traffic.
Common Causes
- Explicit SYSTEM RESTART REPLICA command -- an administrator or script intentionally restarted the replica to fix replication issues.
- Automated recovery procedures -- monitoring systems that issue SYSTEM RESTART REPLICA when they detect replication lag or ZooKeeper session loss.
- ClickHouse internal recovery -- after a ZooKeeper session timeout, ClickHouse may automatically restart replicas.
- Concurrent access during maintenance -- queries arriving while a maintenance window operation is restarting replicas.
Troubleshooting and Resolution Steps
Wait and retry. This is a transient error. The table will become available once the restart finishes. A retry with short backoff (1-2 seconds) is usually sufficient:
-- Simply retry the query after a brief pauseCheck if a SYSTEM RESTART REPLICA is in progress:
SELECT query, elapsed FROM system.processes WHERE query LIKE '%RESTART REPLICA%';Monitor the replication queue to see when the restart completes:
SELECT database, table, is_currently_executing, num_tries, last_exception FROM system.replication_queue WHERE table = 'my_table' ORDER BY create_time DESC LIMIT 10;Check ZooKeeper connectivity. If the restart is taking unusually long, ZooKeeper may be slow or unreachable:
SELECT * FROM system.zookeeper WHERE path = '/';Review the ClickHouse server logs for the restart reason:
grep -i "restart replica" /var/log/clickhouse-server/clickhouse-server.log | tail -20If the restart seems stuck, check for ZooKeeper session issues and consider restarting the ClickHouse server as a last resort.
Best Practices
- Schedule SYSTEM RESTART REPLICA operations during maintenance windows or low-traffic periods.
- Implement retry logic with exponential backoff in applications that query replicated tables.
- Use a load balancer that can route queries to healthy replicas while one is restarting.
- Monitor replica health proactively so that manual restarts are rarely needed.
- Set appropriate ZooKeeper session timeouts to balance between false restarts and quick failure detection.
Frequently Asked Questions
Q: How long does a replica restart typically take?
A: Usually a few seconds for healthy tables. Tables with large replication queues or slow ZooKeeper connections may take longer. If it exceeds a few minutes, investigate ZooKeeper health.
Q: Can I query other tables in the same database while one table is restarting?
A: Yes. The restart only affects the specific table being restarted. Other tables in the same database remain fully accessible.
Q: Should I automatically restart replicas when replication lag is detected?
A: Only as a last resort. Replication lag is usually caused by heavy load or resource contention, not a broken replica state. Restarting the replica will not help with resource issues and adds a brief outage. Investigate the root cause first.
Q: Is there an alternative to SYSTEM RESTART REPLICA for fixing replication issues?
A: For many replication problems, SYSTEM SYNC REPLICA my_table is a less disruptive option. It waits for the replication queue to be processed without taking the table offline. Only use RESTART REPLICA when the replica's replication state is genuinely corrupted.