ClickHouse Monitoring: System Tables, Prometheus Metrics, and Key Signals

ClickHouse ships with a rich set of built-in observability primitives that most other databases don't come close to matching. The tricky part isn't instrumenting it - it's knowing which of the hundreds of available metrics and system tables are actually diagnostic. A server can look healthy by conventional OS metrics (CPU, memory, disk) while quietly accumulating replication lag, drowning in merge backlog, or executing one slow query that blocks others. This article covers the monitoring approach that actually catches those problems in production.

System Tables as the Primary Monitoring Interface

ClickHouse exposes its internals through system tables rather than proprietary status commands. Four tables cover the majority of operational monitoring needs: system.metrics, system.events, system.asynchronous_metrics, and system.query_log. They serve different purposes and update at different frequencies.

system.metrics holds the current instantaneous value for a set of named gauges - things like the number of queries currently running, active HTTP connections, and the merge thread pool occupancy. system.events is a monotonically increasing counter table, tracking lifetime totals for events like the number of SELECT queries executed, bytes read from disk, and ZooKeeper transactions issued. Neither table stores history by default; they reflect the current process state. For historical trends, ClickHouse writes periodic snapshots to system.metric_log and system.asynchronous_metric_log. Both have been enabled by default since version 21.8; to disable them, use remove="1" on the respective config elements.

system.asynchronous_metrics is updated in the background once per minute and contains computed metrics that are too expensive to calculate inline - actual RSS memory usage, filesystem-level disk usage, and replica lag aggregates like ReplicasMaxAbsoluteDelay. The query below gives you a quick health snapshot from this table:

SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric IN (
    'ReplicasMaxAbsoluteDelay',
    'ReplicasMaxQueueSize',
    'UncompressedCacheCells',
    'MarkCacheFiles',
    'jemalloc.resident'
)
ORDER BY metric;

system.query_log is where you spend most of your time diagnosing performance issues. It stores one row per query execution (or per exception), including the query text, duration in microseconds, memory usage peak, rows read, bytes read from disk, and the user who ran it. The table is written to asynchronously and is not guaranteed to be complete for queries that are still running - use system.processes for live queries. A query that surfaces your slowest recent queries:

SELECT
    query_id,
    user,
    query_duration_ms,
    memory_usage,
    read_bytes,
    read_rows,
    result_rows,
    substring(query, 1, 120) AS query_snippet
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 20;

For percentile tracking across a time window - which matters more than averages in high-variance workloads - use quantiles():

SELECT
    toStartOfMinute(event_time) AS minute,
    quantiles(0.5, 0.95, 0.99)(query_duration_ms) AS duration_quantiles
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute;

Tracking Parts and Merge Health

The system.parts table is the authoritative source for MergeTree storage health. Each row represents one data part on disk. The metric that matters most is how many active parts exist per partition, because ClickHouse merges parts in the background and query performance degrades as the part count grows - every SELECT must open and scan each part individually.

ClickHouse applies two thresholds per partition: parts_to_delay_insert (default 150) starts artificially slowing inserts to give merges time to catch up, and parts_to_throw_insert (default 300) rejects inserts outright with a "Too many parts" exception. A healthy table should stay well below 150 parts per partition. This query surfaces the worst offenders:

SELECT
    database,
    table,
    partition,
    count() AS part_count,
    sum(bytes_on_disk) AS total_bytes
FROM system.parts
WHERE active
GROUP BY database, table, partition
HAVING part_count > 50
ORDER BY part_count DESC
LIMIT 20;

A high part count in a specific partition tells you that the merge thread pool is behind on that table - either because inserts are arriving faster than merges can process them, or because the background merge pool is starved by other work. The system.merges table shows what's currently merging and how fast it's progressing:

SELECT
    database,
    table,
    elapsed,
    progress,
    total_size_bytes_compressed,
    rows_read,
    rows_written
FROM system.merges
ORDER BY elapsed DESC;

Prometheus Integration

ClickHouse has a built-in HTTP endpoint that serves Prometheus-format metrics. Enable it with a block in config.xml (or a file in config.d/):

<prometheus>
    <endpoint>/metrics</endpoint>
    <port>9363</port>
    <metrics>true</metrics>
    <events>true</events>
    <asynchronous_metrics>true</asynchronous_metrics>
    <status_info>true</status_info>
</prometheus>

Once in place, curl -s http://localhost:9363/metrics returns several hundred metrics in Prometheus text format. The metrics flag exports system.metrics values, events exports system.events counters, and asynchronous_metrics covers the background-computed values. All three are worth enabling - they cover different signal types and the overhead is negligible.

The Prometheus scrape config is standard:

scrape_configs:
  - job_name: clickhouse
    static_configs:
      - targets: ['clickhouse-host:9363']
    scrape_interval: 15s

The naming convention ClickHouse uses for Prometheus metrics follows the pattern ClickHouseMetrics_* for current gauges, ClickHouseProfileEvents_* for event counters, and ClickHouseAsyncMetrics_* for asynchronous metrics. The ones worth alerting on directly from Prometheus include ClickHouseMetrics_Query (current concurrent query count), ClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay (replication lag in seconds), and ClickHouseMetrics_BackgroundMergesAndMutationsPoolTask (background merge/mutation pool saturation — the older ClickHouseMetrics_BackgroundPoolTask was split into separate pool metrics in version 21.11 and no longer exists).

One limitation of the built-in endpoint: it doesn't expose per-table or per-query breakdown. For that level of detail you still need to query system tables directly, either from Grafana using the ClickHouse datasource plugin, or from your own monitoring scripts.

Key Metrics and Alerting Thresholds

Replication lag is the most operationally dangerous metric for replicated clusters. ClickHouse stores it in system.replicas:

SELECT
    database,
    table,
    is_leader,
    absolute_delay,
    queue_size,
    inserts_in_queue,
    merges_in_queue
FROM system.replicas
ORDER BY absolute_delay DESC;

absolute_delay measures how far behind this replica is from the leader in seconds. Lag under 5 seconds is normal during heavy insert bursts. Sustained lag above 30 seconds warrants investigation. Above 300 seconds, the replica is practically offline for consistent reads. inserts_in_queue growing while merges_in_queue stays low points to a network or ZooKeeper issue; the inverse suggests local disk or CPU saturation.

ZooKeeper (or ClickHouse Keeper) health directly affects replication. A blocked ZooKeeper session stalls all DDL operations and can halt replication entirely. The ClickHouseMetrics_ZooKeeperSessionExpired metric (from system.metrics, exposed as ClickHouseMetrics_ZooKeeperSessionExpired in Prometheus) tracks expired sessions — any non-zero value there should page someone. ZooKeeper latency consistently above 50ms is a sign the ensemble is overloaded, often because mutation frequency is too high or because INSERT QUORUM is set without a correspondingly sized ZooKeeper cluster.

Memory is tracked at the query level through system.query_log.memory_usage and at the server level through ClickHouseAsyncMetrics_jemalloc_resident. A query exceeding several gigabytes of RAM is not unusual for large aggregations, but memory usage climbing near the max_server_memory_usage limit (which defaults to 90% of available RAM) will cause queries to start failing with memory limit exceeded errors:

SELECT
    user,
    max(memory_usage) AS peak_memory,
    count() AS query_count,
    substring(query, 1, 100) AS query_snippet
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 6 HOUR
GROUP BY user, query_snippet
ORDER BY peak_memory DESC
LIMIT 10;

For disk, watch ClickHouseAsyncMetrics_FilesystemMainPathAvailableBytes and ClickHouseAsyncMetrics_FilesystemLogsPathAvailableBytes separately - the data path and log path can be on different filesystems, and ClickHouse will stop accepting writes if either runs out of space.

Grafana Dashboards

The official ClickHouse datasource plugin for Grafana (grafana-clickhouse-datasource) ships with three built-in dashboards: Cluster Analysis, Data Analysis, and Query Analysis. Cluster Analysis covers merges, mutations, and replication state. Data Analysis shows partition and part breakdown per table. Query Analysis exposes latency distributions, error rates, and resource consumption per query type. These dashboards query system tables directly using the ClickHouse datasource plugin (configurable for either HTTP on port 8123 or the native binary protocol on port 9000), which means they get per-table granularity but require the Grafana server to have network access to ClickHouse on whichever port is configured.

For Prometheus-based Grafana setups, dashboard ID 23285 ("ClickHouse and Keeper Comprehensive Dashboard") on Grafana Labs is the most complete community option as of early 2026. It covers server resource metrics, replication health, Keeper session state, and merge queue depth, all sourced from the Prometheus endpoint. The Altinity ClickHouse Operator dashboard (ID 12163) is relevant if you're running ClickHouse on Kubernetes via the Altinity operator.

A practical limitation of the official built-in dashboards is that they're oriented toward support analysis rather than production alerting - they're dense with raw metrics but don't ship with alert rules. Building alert rules on top of the Prometheus metrics is straightforward; the more effortful part is calibrating thresholds against your actual workload patterns rather than generic defaults. Start with ReplicasMaxAbsoluteDelay > 30, part count per partition exceeding 100, and concurrent query count (ClickHouseMetrics_Query) staying above your max_concurrent_queries setting for more than a few seconds.

The system table approach and the Prometheus approach are complementary rather than alternatives. Prometheus is better for time-series alerting and cross-host aggregation. System tables are better for root-cause analysis after an alert fires - they give you the actual query text, per-table breakdown, and merge-level detail that Prometheus metrics can't carry. A mature ClickHouse monitoring setup uses both.