The "DB::Exception: Server overloaded" error in ClickHouse indicates that the server is rejecting a new query because it has detected CPU overload. The SERVER_OVERLOADED error code is raised based on the ratio of CPU wait time to CPU busy time: when that ratio crosses the threshold defined by the query-level settings min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw, the query is rejected with a probability that scales linearly between the two thresholds. This is a backpressure mechanism that protects the server from being overwhelmed when CPU contention is high.
Impact
New queries are rejected with an immediate error. Queries already in progress continue to execute. This is a protective mechanism that prevents the server from crashing under excessive load. All users attempting to submit new queries will receive this error until the server's load decreases. Applications without retry logic will experience failures.
Common Causes
- High CPU contention causing the CPU wait-to-busy time ratio to exceed
min_os_cpu_wait_time_ratio_to_throw - A sudden spike in query traffic (e.g., dashboard refresh storm) saturating available CPU
- Heavy background merges or mutations competing for CPU alongside query traffic
- Insufficient CPU resources for the workload
- A runaway query consuming disproportionate CPU and crowding out other queries
- Thresholds set too aggressively low for the workload, causing the server to reject queries under normal load
Troubleshooting and Resolution Steps
Check the current number of running queries:
SELECT count() FROM system.processes;Review which queries are consuming the most resources:
SELECT query_id, user, elapsed, memory_usage, read_rows, query FROM system.processes ORDER BY memory_usage DESC LIMIT 10;Check server memory usage:
SELECT metric, formatReadableSize(value) AS value FROM system.asynchronous_metrics WHERE metric IN ('OSMemoryTotal', 'OSMemoryFreeWithoutCached'); -- Current memory tracked by the server process: SELECT formatReadableSize(value) AS memory_tracked FROM system.metrics WHERE metric = 'MemoryTracking';Kill runaway queries that are consuming excessive resources:
KILL QUERY WHERE query_id = 'problematic_query_id';Check the overload-throw thresholds. These are query-level settings, so query
system.settings:SELECT name, value FROM system.settings WHERE name IN ('min_os_cpu_wait_time_ratio_to_throw', 'max_os_cpu_wait_time_ratio_to_throw');The related
os_cpu_busy_time_thresholdis a server-level setting, found insystem.server_settings:SELECT name, value FROM system.server_settings WHERE name = 'os_cpu_busy_time_threshold';Temporarily reduce background merge activity to free resources:
SYSTEM STOP MERGES; -- After the overload is resolved: SYSTEM START MERGES;If the issue is persistent, add capacity or optimize queries:
-- Check which queries are most frequent and expensive SELECT normalized_query_hash, count() AS cnt, avg(query_duration_ms) AS avg_duration, avg(memory_usage) AS avg_memory FROM system.query_log WHERE event_date = today() GROUP BY normalized_query_hash ORDER BY cnt * avg_memory DESC LIMIT 20;
Best Practices
- Tune
min_os_cpu_wait_time_ratio_to_throwandmax_os_cpu_wait_time_ratio_to_throwto match your tolerance for CPU contention; raising them makes the server less likely to reject queries. - Implement query queuing and retry logic with exponential backoff in client applications, since SERVER_OVERLOADED is a transient, probabilistic rejection.
- Limit concurrency separately with
max_concurrent_queriesandmax_concurrent_queries_for_userto keep CPU contention from building up in the first place. - Monitor CPU wait time (
OSCPUWaitMicroseconds) and busy time (OSCPUVirtualTimeMicroseconds) and set up alerts before the overload threshold is reached. - Optimize expensive queries to reduce per-query CPU consumption.
- Spread background merges and mutations across off-peak hours when possible.
- Consider horizontal scaling (adding more replicas or shards) if overload is a recurring issue.
Frequently Asked Questions
Q: Does SERVER_OVERLOADED mean the server is crashing?
A: No. It means the server is proactively rejecting new queries to prevent a crash. The server is functioning correctly by protecting itself. Existing queries continue to run, and the server will accept new queries once the load decreases.
Q: How do I stop the server from throwing this error?
A: The throw behavior is governed by min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw. Raising these thresholds (or setting min equal to max to disable the probabilistic range) makes rejections less likely, but the underlying CPU contention remains. The better fix is to reduce CPU pressure by optimizing queries, limiting concurrency, or adding CPU capacity.
Q: Should I simply raise the thresholds when I see this error?
A: Not without understanding the root cause. Raising the thresholds without addressing CPU contention only defers the problem and can let the server degrade further before backpressure kicks in. First identify why CPU is saturated, then optimize queries, add capacity, or adjust limits appropriately.