ClickHouse System Tables: Monitoring and Introspection

ClickHouse system tables are special read-only tables in the system database that provide introspection and monitoring capabilities for your ClickHouse cluster. These tables contain real-time information about server configuration, running queries, table metadata, replication status, metrics, and historical query logs. System tables are essential tools for database administrators and developers to monitor performance, troubleshoot issues, and understand cluster behavior.

Key System Tables

Query and Performance Monitoring

system.query_log - Contains information about executed queries including execution time, memory usage, and read/written rows. This table is only populated if the log_queries setting is enabled.

system.processes - Shows currently running queries and their progress. Useful for identifying long-running or problematic queries in real-time.

system.query_thread_log - Provides detailed information about threads used during query execution, including CPU time and memory allocations.

system.query_views_log - Records information about materialized and live views accessed during query execution.

Metrics and Statistics

system.metrics - Contains current metric values like number of connections, running queries, and memory usage. These are point-in-time snapshots.

system.events - Cumulative counters for various events like queries executed, inserted rows, and bytes read. Resets on server restart.

system.asynchronous_metrics - Metrics that are calculated periodically in the background, such as CPU usage, memory consumption, and disk utilization.

system.metric_log - Historical log of metric values from system.metrics and system.asynchronous_metrics, useful for trend analysis.

Table and Database Information

system.tables - Metadata about all tables including engine type, creation date, and storage statistics.

system.columns - Information about columns in all tables including data types, compression codecs, and default values.

system.databases - Lists all databases with their engines and metadata.

system.parts - Detailed information about data parts in MergeTree family tables, including part size, row count, and modification times.

system.parts_columns - Statistics about columns within each data part, useful for understanding compression ratios and storage efficiency.

Replication and Distribution

system.replicas - Status information for replicated tables including replication lag, queue size, and whether replicas are active.

system.replication_queue - Shows pending replication operations waiting to be executed on replica tables.

system.clusters - Configuration of defined clusters for distributed query execution.

system.distributed_ddl_queue - Queue of DDL operations to be executed across distributed clusters.

system.mutations - Information about ongoing and completed mutations (ALTER TABLE operations) on MergeTree tables.

Merges and Background Operations

system.merges - Currently running merge operations, including progress and estimated completion time.

system.merge_tree_settings - Default settings for MergeTree family table engines.

system.background_schedule_pool_size - Information about background task scheduling.

Server Configuration

system.settings - All available server settings and their current values.

system.server_settings - Server-level configuration settings from config files.

system.build_options - Information about how ClickHouse was built, including compiler version and enabled features.

system.macros - User-defined macros used in DDL queries, particularly for distributed setups.

Best Practices

Enable query logging selectively: The query_log table can grow rapidly. Use log_queries_min_type and retention policies to manage its size while capturing important information.
Regular monitoring: Query system.metrics and system.events regularly to establish baseline performance metrics and identify anomalies.
Monitor replication lag: For replicated setups, regularly check system.replicas to ensure replication is healthy and lag is minimal.
Track resource usage: Use system.asynchronous_metrics to monitor CPU, memory, and disk usage trends over time.
Identify slow queries: Query system.query_log to find queries with high execution time or resource consumption for optimization opportunities.
Monitor background operations: Check system.merges and system.mutations to understand background operation load and potential performance impacts.
Archive historical data: Export important system table data periodically for long-term trend analysis and compliance requirements.
Use retention policies: Configure appropriate TTL for query_log, metric_log, and other growing system tables to prevent unbounded growth.

Common Issues or Misuses

Query log overflow: Not setting appropriate retention policies on query_log can lead to excessive disk usage and performance degradation.
Overquerying system tables: Excessive queries to system tables, especially large ones like query_log and parts, can impact cluster performance.
Misinterpreting metrics: Not understanding the difference between cumulative counters (system.events) and current values (system.metrics) can lead to incorrect conclusions.
Ignoring replication warnings: Failing to monitor system.replicas can result in unnoticed replication failures or growing replication lag.
Not monitoring parts: Neglecting system.parts can lead to undetected issues like too many small parts or failed merges.
Insufficient retention: Setting overly aggressive retention policies may remove historical data needed for troubleshooting or compliance.
Security oversights: Granting unrestricted access to system tables can expose sensitive information about queries, users, and cluster configuration.

Additional Relevant Information

Query Log Configuration

To enable query logging, configure the following in your ClickHouse configuration:

<query_log>
    <database>system</database>
    <table>query_log</table>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>

Useful Monitoring Queries

Find slow queries:

SELECT query, query_duration_ms, memory_usage
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_duration_ms > 10000
ORDER BY query_duration_ms DESC
LIMIT 10;

Check replication lag:

SELECT database, table, absolute_delay, queue_size, is_readonly
FROM system.replicas
WHERE absolute_delay > 10
ORDER BY absolute_delay DESC;

Monitor current query load:

SELECT user, count() AS query_count, sum(read_rows) AS total_rows_read
FROM system.processes
GROUP BY user;

Identify tables with too many parts:

SELECT database, table, count() AS parts_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING parts_count > 100
ORDER BY parts_count DESC;

System Table Engines

Most system tables use special engines:

SystemLog: Used for *_log tables, supports partitioning and TTL
SystemNumbers: Generates number sequences
SystemOne: Returns a single row, useful for testing

System tables are generally read-only, with a few exceptions like system.distributed_ddl_queue which supports deletion of completed entries.

Frequently Asked Questions

Q: How do I enable query logging in ClickHouse?
A: Query logging is typically enabled by default, but you need to ensure the log_queries setting is set to 1. You can verify this with SELECT value FROM system.settings WHERE name = 'log_queries'. To enable it for your session, use SET log_queries = 1. For server-wide configuration, add the appropriate settings to your config.xml file.

Q: Why is my system.query_log table growing so large?
A: The query_log table records every query executed when logging is enabled. To manage its size, configure a TTL (Time To Live) in your server configuration to automatically delete old entries. For example: <ttl>event_date + INTERVAL 30 DAY</ttl> will keep logs for 30 days.

Q: How can I see what queries are currently running?
A: Query the system.processes table to see all currently executing queries: SELECT query_id, user, query, elapsed FROM system.processes. This shows real-time information about active queries, their duration, and which user initiated them.

Q: What's the difference between system.metrics and system.events?
A: system.metrics contains current, instantaneous values (like number of active connections right now), while system.events contains cumulative counters since server start (like total number of queries executed). Use metrics for current state and events for analyzing trends over time.

Q: How do I monitor replication health in ClickHouse?
A: Use the system.replicas table to monitor replication status. Key columns to watch include absolute_delay (replication lag in seconds), queue_size (pending operations), is_readonly (whether the replica is read-only), and total_replicas vs active_replicas. Set up alerts when absolute_delay exceeds acceptable thresholds.

Q: Can I delete data from system tables?
A: Most system tables are read-only and don't support deletions. However, you can configure TTL policies for *_log tables to automatically remove old data. For some special cases like system.distributed_ddl_queue, you can delete specific entries. Generally, manage system table size through configuration rather than manual deletion.

Q: How do I find which tables are using the most disk space?
A: Query the system.parts table and aggregate by table: SELECT database, table, sum(bytes_on_disk) AS size FROM system.parts WHERE active GROUP BY database, table ORDER BY size DESC. This shows the total size of active parts for each table.

Q: What should I monitor to ensure cluster health?
A: Key system tables to monitor include: system.replicas for replication health, system.merges for background merge operations, system.mutations for pending alterations, system.asynchronous_metrics for resource usage, and system.parts for part count and fragmentation. Set up regular checks on these tables and alert on anomalies.