ClickHouse Settings to Adjust from Defaults in Production

Q: Which setting catches the most production issues with the least friction?

Adding TTL to query_log and friends. It silently prevents the disk-fills-up incident that affects almost every long-running cluster.

Q: How do I pick `max_bytes_before_external_group_by`?

Roughly half of max_memory_usage . The threshold should fire before the hard memory cap, otherwise queries fail instead of spilling.

Q: Are these settings safe to apply via Kubernetes ConfigMaps?

Yes, with one caveat: avoid overrides for log tables, as the documentation calls out possible configuration clashes during upgrades.

ClickHouse defaults aim for safety on small installations rather than throughput on production clusters. Several settings should be adjusted on any non-trivial deployment. This guide lists the ones that matter most, grouped by area, with the values commonly used in production.

System Logging

Log tables grow forever by default. Add TTL to bound their size:

<query_log>
    <database>system</database>
    <table>query_log</table>
    <ttl>event_date + INTERVAL 90 DAY DELETE</ttl>
</query_log>

Recommended retentions:

Table	TTL
`query_log`	30 to 90 days
`query_thread_log`	7 to 14 days, or disable
`trace_log`	14 days
`metric_log`	14 days, or disable if external monitoring exists
`asynchronous_metric_log`	14 days, or remove via `<asynchronous_metric_log remove="1"/>`
`part_log`	90 days, useful for merge analysis

Keep query_log enabled. It has very useful information for debugging.

query_thread_log can be disabled per user without restart:

<profiles>
    <default>
        <log_query_threads>0</log_query_threads>
    </default>
</profiles>

Query Logging Defaults

Enable query logging globally for traceability:

<profiles>
    <default>
        <log_queries>1</log_queries>
    </default>
</profiles>

External Aggregation and Sort

By default, large GROUP BY and ORDER BY queries fail when they exceed max_memory_usage. Enable spill-to-disk so they complete instead:

<profiles>
    <default>
        <max_bytes_before_external_group_by>2000000000</max_bytes_before_external_group_by>
        <max_bytes_before_external_sort>2000000000</max_bytes_before_external_sort>
    </default>
</profiles>

2 GB is a starting point. Set to roughly half of max_memory_usage so the spill kicks in before the hard limit.

Prevent Accidental Full Scans

force_index_by_date and force_primary_key make ClickHouse reject queries that would scan an entire table because they cannot use the partition or primary key:

<profiles>
    <default>
        <force_index_by_date>1</force_index_by_date>
        <force_primary_key>1</force_primary_key>
        <max_execution_time>300</max_execution_time>
    </default>
</profiles>

Combined with max_execution_time, these settings catch the queries that take a server down.

Concurrency Limits

The defaults are conservative for modern hardware. For workloads with many async inserts and parallel selects, raise them:

<max_concurrent_queries>500</max_concurrent_queries>
<max_concurrent_insert_queries>400</max_concurrent_insert_queries>
<max_concurrent_select_queries>100</max_concurrent_select_queries>

These live at the top level of config.xml, not under <profiles>. They are server-wide caps.

FINAL Query Optimization

If you use SELECT ... FINAL against ReplacingMergeTree or similar engines, enabling this setting allows the final merge to run per partition in parallel:

<profiles>
    <default>
        <do_not_merge_across_partitions_select_final>1</do_not_merge_across_partitions_select_final>
    </default>
</profiles>

It is safe only when the table's logical merge happens entirely within a partition, which is the normal case for time-partitioned data.

Memory Limits

Configure memory caps at three levels:

<profiles>
    <default>
        <max_memory_usage>10000000000</max_memory_usage>
        <max_memory_usage_for_user>20000000000</max_memory_usage_for_user>
    </default>
</profiles>

<max_server_memory_usage_to_ram_ratio>0.9</max_server_memory_usage_to_ram_ratio>

A typical breakdown: per-query 25% of RAM, per-user 50%, server 90%. Adjust to your workload.

RBAC

Enable SQL-driven user management instead of editing users.xml files:

<users>
    <default>
        <access_management>1</access_management>
    </default>
</users>

Now you can use CREATE USER, CREATE ROLE, and GRANT statements at runtime.

Durability

If inserts cannot be re-driven, raise the durability bar with quorum writes:

<profiles>
    <default>
        <insert_quorum>2</insert_quorum>
        <insert_quorum_timeout>60000</insert_quorum_timeout>
        <select_sequential_consistency>1</select_sequential_consistency>
    </default>
</profiles>

With insert_quorum = 2, an insert is only acknowledged once two replicas have it. Use 0 (default) on workloads that can tolerate replay.

Common Pitfalls

Setting max_memory_usage too high so a single query can OOM the box.
Enabling force_primary_key on a cluster where many existing dashboards use full scans. Roll out per-user first.
Forgetting that max_concurrent_queries is server-wide. Per-user limits go under <profiles>.
Setting insert_quorum without insert_quorum_timeout. A failing replica stalls every insert.
Adding <engine> blocks to log tables in Kubernetes. Stick to <ttl> to avoid clashes with chart upgrades.

Frequently Asked Questions

Q: Which setting catches the most production issues with the least friction? A: Adding TTL to query_log and friends. It silently prevents the disk-fills-up incident that affects almost every long-running cluster.

Q: Should I enable force_primary_key everywhere? A: Eventually yes, but roll it out to one profile at a time. Existing dashboards may rely on key-less scans.

Q: How do I pick max_bytes_before_external_group_by? A: Roughly half of max_memory_usage. The threshold should fire before the hard memory cap, otherwise queries fail instead of spilling.

Q: Do I need to restart for these changes? A: User profile settings are picked up on the next connection. Server-level settings (max_concurrent_queries, log table TTL via <ttl>) usually need a config reload, which is a SIGHUP, not a restart.

Q: Are these settings safe to apply via Kubernetes ConfigMaps? A: Yes, with one caveat: avoid <engine> overrides for log tables, as the documentation calls out possible configuration clashes during upgrades.