ClickHouse DB::Exception: Server overloaded

The "DB::Exception: Server overloaded" error in ClickHouse indicates that the server is rejecting a new query because it has detected CPU overload. The SERVER_OVERLOADED error code is raised based on the ratio of CPU wait time to CPU busy time: when that ratio crosses the threshold defined by the query-level settings min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw, the query is rejected with a probability that scales linearly between the two thresholds. This is a backpressure mechanism that protects the server from being overwhelmed when CPU contention is high.

Impact

New queries are rejected with an immediate error. Queries already in progress continue to execute. This is a protective mechanism that prevents the server from crashing under excessive load. All users attempting to submit new queries will receive this error until the server's load decreases. Applications without retry logic will experience failures.

Common Causes

  1. High CPU contention causing the CPU wait-to-busy time ratio to exceed min_os_cpu_wait_time_ratio_to_throw
  2. A sudden spike in query traffic (e.g., dashboard refresh storm) saturating available CPU
  3. Heavy background merges or mutations competing for CPU alongside query traffic
  4. Insufficient CPU resources for the workload
  5. A runaway query consuming disproportionate CPU and crowding out other queries
  6. Thresholds set too aggressively low for the workload, causing the server to reject queries under normal load

Troubleshooting and Resolution Steps

  1. Check the current number of running queries:

    SELECT count() FROM system.processes;
    
  2. Review which queries are consuming the most resources:

    SELECT query_id, user, elapsed, memory_usage, read_rows, query
    FROM system.processes
    ORDER BY memory_usage DESC
    LIMIT 10;
    
  3. Check server memory usage:

    SELECT metric, formatReadableSize(value) AS value
    FROM system.asynchronous_metrics
    WHERE metric IN ('OSMemoryTotal', 'OSMemoryFreeWithoutCached');
    
    -- Current memory tracked by the server process:
    SELECT formatReadableSize(value) AS memory_tracked
    FROM system.metrics
    WHERE metric = 'MemoryTracking';
    
  4. Kill runaway queries that are consuming excessive resources:

    KILL QUERY WHERE query_id = 'problematic_query_id';
    
  5. Check the overload-throw thresholds. These are query-level settings, so query system.settings:

    SELECT name, value
    FROM system.settings
    WHERE name IN ('min_os_cpu_wait_time_ratio_to_throw',
                   'max_os_cpu_wait_time_ratio_to_throw');
    

    The related os_cpu_busy_time_threshold is a server-level setting, found in system.server_settings:

    SELECT name, value
    FROM system.server_settings
    WHERE name = 'os_cpu_busy_time_threshold';
    
  6. Temporarily reduce background merge activity to free resources:

    SYSTEM STOP MERGES;
    -- After the overload is resolved:
    SYSTEM START MERGES;
    
  7. If the issue is persistent, add capacity or optimize queries:

    -- Check which queries are most frequent and expensive
    SELECT normalized_query_hash, count() AS cnt,
           avg(query_duration_ms) AS avg_duration,
           avg(memory_usage) AS avg_memory
    FROM system.query_log
    WHERE event_date = today()
    GROUP BY normalized_query_hash
    ORDER BY cnt * avg_memory DESC
    LIMIT 20;
    

Best Practices

  • Tune min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw to match your tolerance for CPU contention; raising them makes the server less likely to reject queries.
  • Implement query queuing and retry logic with exponential backoff in client applications, since SERVER_OVERLOADED is a transient, probabilistic rejection.
  • Limit concurrency separately with max_concurrent_queries and max_concurrent_queries_for_user to keep CPU contention from building up in the first place.
  • Monitor CPU wait time (OSCPUWaitMicroseconds) and busy time (OSCPUVirtualTimeMicroseconds) and set up alerts before the overload threshold is reached.
  • Optimize expensive queries to reduce per-query CPU consumption.
  • Spread background merges and mutations across off-peak hours when possible.
  • Consider horizontal scaling (adding more replicas or shards) if overload is a recurring issue.

Frequently Asked Questions

Q: Does SERVER_OVERLOADED mean the server is crashing?
A: No. It means the server is proactively rejecting new queries to prevent a crash. The server is functioning correctly by protecting itself. Existing queries continue to run, and the server will accept new queries once the load decreases.

Q: How do I stop the server from throwing this error?
A: The throw behavior is governed by min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw. Raising these thresholds (or setting min equal to max to disable the probabilistic range) makes rejections less likely, but the underlying CPU contention remains. The better fix is to reduce CPU pressure by optimizing queries, limiting concurrency, or adding CPU capacity.

Q: Should I simply raise the thresholds when I see this error?
A: Not without understanding the root cause. Raising the thresholds without addressing CPU contention only defers the problem and can let the server degrade further before backpressure kicks in. First identify why CPU is saturated, then optimize queries, add capacity, or adjust limits appropriately.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.