NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse High CPU Usage: Troubleshooting Guide

High CPU usage in ClickHouse is usually not a fault. ClickHouse is designed to use every available core to drive a single analytical query as fast as possible, so seeing 100% CPU on a heavy SELECT over a few billion rows is the expected behavior. The real questions are: which queries are causing it, is the work actually necessary, and can you cap individual users or queries to protect concurrent workloads. This guide walks through the controls in order: identify the offenders, reshape the queries when possible, then apply max_threads, os_thread_priority, and quotas to bound the remaining damage.

First Principle: 100% CPU Is Often Correct

For an analytical engine, sustained high CPU during query execution is normal. ClickHouse parallelizes scans, filters, aggregations, and sorts across all available cores by default. A 32-core box running a single big aggregate at 3200% CPU (in top units) is doing exactly what it was designed to do.

The problem is not the CPU number; it is when:

  • The high CPU comes from background work (merges, mutations, fetches) blocking queries.
  • A single user or query starves everyone else.
  • Queries are inefficient and process far more data than they need to.
  • Many concurrent queries collide and degrade tail latency.

Fix the cause, not the symptom. Throttling CPU always means slower queries. Useful when isolating tenants, harmful when it masks a bad schema.

Step 1: Identify the Offending Queries

Look at currently running queries and recent finished ones:

SELECT
    query_id,
    user,
    elapsed,
    read_rows,
    read_bytes,
    memory_usage,
    formatReadableSize(memory_usage) AS mem,
    substring(query, 1, 200) AS query
FROM system.processes
ORDER BY elapsed DESC;

For history (last hour, sorted by CPU time):

SELECT
    user,
    query_duration_ms,
    ProfileEvents['UserTimeMicroseconds'] / 1e6 AS user_cpu_seconds,
    ProfileEvents['SystemTimeMicroseconds'] / 1e6 AS sys_cpu_seconds,
    read_rows,
    formatReadableSize(read_bytes) AS read,
    substring(query, 1, 200) AS query
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND type = 'QueryFinish'
ORDER BY user_cpu_seconds DESC
LIMIT 20;

Aggregate by query pattern to find recurring offenders:

SELECT
    normalizeQuery(query) AS pattern,
    count() AS runs,
    avg(query_duration_ms) AS avg_ms,
    sum(ProfileEvents['UserTimeMicroseconds']) / 1e6 AS total_cpu_seconds
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND type = 'QueryFinish'
GROUP BY pattern
ORDER BY total_cpu_seconds DESC
LIMIT 20;

Also check background CPU consumers:

SELECT * FROM system.merges;
SELECT * FROM system.mutations WHERE NOT is_done;
SELECT type, count() FROM system.replication_queue GROUP BY type;

Step 2: Reshape the Workload

Long-term, the cheapest CPU is the CPU you do not spend. Before reaching for limits, ask:

  • Does the ORDER BY of the table match the most common WHERE clauses? If not, you are scanning far more than needed.
  • Are there obvious projections, materialized views, or precomputed aggregates that could replace expensive runtime aggregation?
  • Are queries written by an application using SELECT * when they need three columns? In a columnar store this is dramatic waste.
  • Are joins broadcasting large right-hand sides? Convert to dictionaries or IN (subquery) where possible.
  • Are FINAL queries needed at all, or can the application read with a deduplication-aware pattern?

Auto-generated SQL from BI tools is a frequent culprit. Capture a sample, review what it actually does, then fix the data model or the BI semantic layer.

Step 3: Cap Threads per Query

max_threads controls how many CPU cores a single request will use. Default matches the number of hardware threads available to ClickHouse. Lower it to leave headroom for other queries:

SET max_threads = 4;
SELECT max(number) FROM numbers(100000000);

On a 16-core box, max_threads = 4 caps that query at roughly 25% CPU. The query takes longer, but other workloads keep cores.

For specific users via the users.xml profile:

<profiles>
  <reporting>
    <max_threads>4</max_threads>
    <max_memory_usage>10000000000</max_memory_usage>
  </reporting>
</profiles>

Step 4: Lower Process Priority

os_thread_priority sets the niceness of query threads (Linux nice value, 0 to 19, higher means less priority):

SET os_thread_priority = 19;
SELECT max(number) FROM numbers(100000000);

Via command line:

clickhouse-client --os_thread_priority=19 -q 'SELECT max(number) FROM numbers(100000000)'

Via HTTP:

curl 'http://localhost:8123/?os_thread_priority=19' --data-binary @-

This does not cap CPU. It tells the kernel scheduler to prefer other processes when there is contention. Useful for background reporting jobs that should yield to interactive queries.

Step 5: Enforce Quotas

Quotas restrict per-user resource consumption over a rolling window:

<quotas>
  <analyst>
    <interval>
      <duration>3600</duration>
      <queries>1000</queries>
      <read_rows>100000000000</read_rows>
      <execution_time>3600</execution_time>
    </interval>
  </analyst>
</quotas>

Attach the quota to a user in users.xml. When the user exceeds it, ClickHouse rejects further queries until the window rolls. This is the right tool for multi-tenant environments where one tenant must not consume the whole box.

Common Pitfalls

  • Setting max_threads = 1 globally. This serializes all work and tanks throughput. Use it only on specific reporting profiles.
  • Confusing os_thread_priority with a CPU cap. It only matters under contention. On an idle box the query still uses all available cores.
  • Throttling without identifying the query. You hide the symptom and the bad query keeps running, just slower. Always inspect system.query_log first.
  • Ignoring background CPU. Merges, mutations, and TTL moves consume CPU too. Check system.merges and system.mutations before blaming user queries.
  • Misreading top. Per-thread CPU sums to thousands of percent on multi-core boxes. Use htop with thread grouping disabled, or read system.metrics directly.

Frequently Asked Questions

Q: Is 100% CPU usage in ClickHouse a problem? A: Not by itself. ClickHouse parallelizes analytical queries across all cores and is built to use them. It becomes a problem when it causes other queries to starve, when it persists during idle periods (suggesting runaway background work), or when it stems from inefficient queries.

Q: How do I find which query is consuming the most CPU? A: Query system.query_log filtered by event_time and ordered by ProfileEvents['UserTimeMicroseconds'] descending. For currently running queries, system.processes shows elapsed time and resource usage.

Q: What does max_threads actually do? A: It caps the number of threads ClickHouse uses for a single query's parallel execution. Lower values mean less CPU per query and longer query duration. Default matches the number of hardware threads available to ClickHouse.

Q: Does os_thread_priority reduce CPU usage? A: Only indirectly. It changes the Linux nice value so the kernel scheduler deprioritizes those threads when something else wants the CPU. On an otherwise idle server, the query still runs at full speed.

Q: How do I prevent one user from monopolizing the cluster? A: Use quotas (read_rows, execution_time, queries) in a user profile, combined with max_threads and max_concurrent_queries_for_user. This is the multi-tenant isolation toolkit.

Q: Will reducing CPU usage make queries faster? A: No. Every method that reduces CPU usage makes the affected query slower. The goal is to slow down low-priority queries so high-priority ones get the resources they need.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.