ClickHouse Version Upgrade Performance Regressions

After upgrading ClickHouse, queries that previously ran in seconds may take several times longer, memory usage may spike, or specific query patterns may behave differently. These regressions are not always caused by a single change: each ClickHouse release ships new setting defaults, an evolving query analyzer, revised join algorithms, and occasional bugs that interact with real workloads in unexpected ways.

What This Error Means

ClickHouse releases frequently — roughly twice a month for stable builds — and each release may change the default value of one or more settings. When you upgrade, any setting that you have not explicitly overridden in your configuration silently adopts the new default. A query that worked well under the old default may run a different execution plan, choose a different algorithm, or consume more memory under the new one. Unlike a hard error, a performance regression produces no error message; the query simply takes longer or uses more resources.

In addition to setting changes, each release may include query planner improvements that benefit most workloads while regressing edge cases, and occasionally a code-level bug that directly slows down query processing (for example, the 24.11 regression where job stack trace collection added significant overhead to queries scanning tens of billions of rows).

Common Causes

  1. Changed setting defaults. ClickHouse tracks setting default changes in system.settings_changes. High-impact changes in recent versions include join_algorithm (added parallel_hash to the default list in 24.11/24.12), max_parallel_replicas (raised from 1 to 1000 around 25.2), optimize_functions_to_subcolumns (enabled by default around 24.9), and output_format_json_quote_64bit_integers (disabled by default in 25.8).

  2. New query analyzer enabled by default. enable_analyzer (previously called allow_experimental_analyzer) became enabled by default in ClickHouse 24.3. The new analyzer validates queries more strictly, changes JOIN USING identifier resolution, and can generate deeper query plans for complex CTE + UNION queries or multiIf() + ARRAY JOIN combinations.

  3. Parallel join algorithm using more memory. The join_algorithm default changed from 'direct,hash' to 'direct,parallel_hash,hash' in 24.11. The parallel_hash algorithm uses more memory and can be slower for workloads where the old hash algorithm was already efficient. Note that this change was initially missing from SettingsHistory, meaning the compatibility setting did not protect against it until a follow-up fix (PR #75870).

  4. Aggressive parallel replicas. With max_parallel_replicas defaulting to 1000 since ~25.2 (PR #74504), ClickHouse now uses parallel replicas aggressively for queries. This adds coordination overhead for small datasets or single-replica clusters where parallelism provides no benefit.

  5. Code-level bugs. Some regressions are not setting changes at all. The 24.11 release introduced a regression where job stack trace collection overhead caused a query processing 77 billion rows to go from ~70 seconds to ~159 seconds, with peak memory rising from ~875 MiB to ~1.57 GiB. This was fixed in PR #80978.

  6. Heterogeneous cluster during rolling upgrade. Distributed queries executed against a mixed-version cluster slow down temporarily. The new query analyzer cannot be used across mixed-version clusters: if some nodes have enable_analyzer=1 and others have enable_analyzer=0, distributed queries involving those nodes will fail or perform poorly.

How to Fix

1. Identify which queries regressed

Compare average query duration before and after the upgrade using the upgrade timestamp as a boundary:

SELECT
    normalizedQueryHash(query) AS qhash,
    substr(query, 1, 80) AS query_preview,
    countIf(event_time < '<upgrade_timestamp>') AS runs_before,
    countIf(event_time >= '<upgrade_timestamp>') AS runs_after,
    avgIf(query_duration_ms, event_time < '<upgrade_timestamp>') AS avg_ms_before,
    avgIf(query_duration_ms, event_time >= '<upgrade_timestamp>') AS avg_ms_after,
    round(avg_ms_after / nullIf(avg_ms_before, 0), 2) AS slowdown_ratio
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_kind = 'Select'
GROUP BY qhash, query_preview
HAVING runs_before >= 5 AND runs_after >= 5 AND slowdown_ratio > 1.5
ORDER BY slowdown_ratio DESC
LIMIT 20;

2. Check which setting defaults changed between your versions

-- All default value changes from 24.3 onward
SELECT version, changes
FROM system.settings_changes
WHERE version >= '24.3'
ORDER BY version;

Then check whether those settings are overridden in your current session:

SELECT name, value, default, description
FROM system.settings
WHERE changed = 1
ORDER BY name;

3. Use the compatibility setting to restore old defaults

The compatibility setting restores default values for settings that were not explicitly overridden to match a specific ClickHouse version. Apply it per query or per session:

-- Per query
SELECT count(), avg(price)
FROM sales
SETTINGS compatibility = '23.8';

-- Per session
SET compatibility = '23.8';

Important limitation: the compatibility setting only protects against changes tracked in SettingsHistory. Some changes — including the join_algorithm change in 24.11 — were initially missing from SettingsHistory. Always verify behavior after upgrade even with compatibility set. On ClickHouse Cloud, the compatibility setting is pinned to the ClickHouse version at service creation time and does not change when the service is upgraded.

4. Disable the new query analyzer for affected queries

If regressions appear in complex CTE + UNION queries or multiIf() + ARRAY JOIN patterns, disable enable_analyzer per query as a targeted workaround:

SELECT *
FROM (
    WITH cte AS (SELECT id, sum(value) AS total FROM events GROUP BY id)
    SELECT id, total FROM cte
    UNION ALL
    SELECT id, 0 FROM fallback
)
SETTINGS enable_analyzer = 0;

Note that allow_experimental_analyzer is an alias for enable_analyzer and both names work.

5. Revert the join algorithm to pre-24.11 behavior

If queries with large joins regressed after upgrading past 24.11, revert to the previous default:

-- Per query
SELECT a.id, b.value
FROM table_a AS a
INNER JOIN table_b AS b ON a.id = b.id
SETTINGS join_algorithm = 'hash';

-- Session-wide
SET join_algorithm = 'direct,hash';

6. Compare query plans across versions

Use EXPLAIN PLAN with JSON output to compare execution plans structurally:

EXPLAIN PLAN
    indexes = 1,
    actions = 1,
    json = 1
SELECT count() FROM my_table WHERE date >= '2024-01-01';

Run this on both the old and new version and diff the output. A changed plan is the first signal that the optimizer is taking a different path.

7. Compare ProfileEvents for a specific query

For a query whose query_id is known, compare execution metrics between two runs:

WITH
    '<old_query_id>' AS qid_old,
    '<new_query_id>' AS qid_new
SELECT
    PE.Names AS metric,
    anyIf(PE.Values, query_id = qid_old) AS v_old,
    anyIf(PE.Values, query_id = qid_new) AS v_new,
    round((v_new - v_old) / (v_old + v_new + 1), 3) AS relative_change
FROM system.query_log
ARRAY JOIN ProfileEvents AS PE
WHERE query_id IN (qid_old, qid_new)
  AND type = 'QueryFinish'
GROUP BY metric
HAVING v_old != v_new
ORDER BY abs(relative_change) DESC
LIMIT 30;

Root-Cause Analysis

Use these system tables to trace the source of a regression:

-- 1. Confirm the version on each node
SELECT version();

-- 2. Find all settings that differ from defaults in the current session
SELECT name, value, default, description
FROM system.settings
WHERE changed = 1
ORDER BY name;

-- 3. Find changed MergeTree-level settings
SELECT name, value, default, changed, description
FROM system.merge_tree_settings
WHERE changed = 1;

-- 4. Check which settings were active for a specific slow query
SELECT Settings
FROM system.query_log
WHERE query_id = '<your_query_id>'
  AND type = 'QueryFinish';

-- 5. Detect currently stuck or unexpectedly slow running queries
SELECT
    query_id,
    formatReadableTimeDelta(elapsed) AS running_for,
    read_rows,
    formatReadableSize(memory_usage) AS memory,
    substr(query, 1, 100) AS query_preview
FROM system.processes
ORDER BY elapsed DESC;

Preventive Measures

  • Read the "Backward Incompatible Changes" section of the release notes for every version in your upgrade path, not just the target version.
  • Run production-representative queries against a staging cluster on the new version before upgrading. Target at least 24-48 hours of coverage including peak-load patterns.
  • Upgrade through intermediate LTS versions rather than jumping across multiple years. LTS releases are published in March and August and are supported for at least 12 months. Recent LTS versions: 23.8, 24.3, 24.8, 25.3, 25.8.
  • Avoid upgrading across the one-year compatibility window in a single step. If your nodes differ by more than one year of releases, plan a maintenance window and upgrade all nodes together, or stage through intermediate versions.
  • Before upgrading to 25.11 or later, set merge_tree.serialization_info_version = 'basic' and merge_tree.string_serialization_version = 'single_stream' in your MergeTree config section if rollback capability must be preserved. Once the new String serialization format is written to disk, downgrade to versions before 25.10 is blocked.
  • Place all configuration customizations in /etc/clickhouse-server/config.d/ so they are not overwritten during package upgrades.
  • Monitor query latency, memory usage, and merge rates closely for 48 hours post-upgrade.

Resolve Upgrade Regressions Automatically with Pulse

Tracking which setting defaults changed between two ClickHouse versions, identifying which queries were affected, and cross-referencing known regressions against release notes is time-consuming to do manually after every upgrade. Pulse analyzes your ClickHouse cluster's query history, surfaces queries that slowed down after a version change, and maps them to the specific setting default or known regression that is most likely responsible. Connect your cluster to Pulse to get upgrade impact reports before and after each release.

Frequently Asked Questions

Q: Can the compatibility setting fully protect me from performance changes after an upgrade?
A: No. The compatibility setting restores default values only for settings tracked in SettingsHistory. Some changes — including the join_algorithm default change in 24.11/24.12 — were initially not tracked there, meaning compatibility did not protect against them until a follow-up fix. Always benchmark after upgrading even when compatibility is set.

Q: Can I roll back to the previous version if things go wrong?
A: Rollback is only possible if (a) the previous version is within one year of the new version and (b) no new features or data formats from the upgraded version have been written to disk. Starting with ClickHouse 25.11, the new String serialization format and the advanced JSON shared data format (25.12) permanently block downgrade once they are activated. Set string_serialization_version = 'single_stream' before upgrading to 25.11 if you need to preserve rollback capability.

Q: The new query analyzer is the default since 24.3 — should I disable it?
A: For most workloads, the new analyzer (enable_analyzer=1) is better. Disable it only for specific queries where you observe a regression, using SETTINGS enable_analyzer = 0 on those queries. The new analyzer has known limitations: mutations (ALTER UPDATE/DELETE) still use the old analyzer, and Annoy and Hypothesis indexes are not yet supported.

Q: allow_experimental_analyzer and enable_analyzer — are these the same setting?
A: Yes. allow_experimental_analyzer was the original name when the feature was experimental. It was renamed to enable_analyzer around version 24.3. The old name continues to work as an alias, so both SET allow_experimental_analyzer = 0 and SET enable_analyzer = 0 are equivalent.

Q: How do I upgrade a replicated cluster without downtime?
A: Upgrade one replica per shard at a time, keeping at least one replica per shard on the previous version until the new one is verified healthy. ClickHouse maintains a one-year compatibility window between cluster nodes. Distributed queries may slow temporarily during the mixed-version state, and ReplicatedMergeTree background operations may log retryable errors until all nodes are on the same version.

Q: My output_format_json_quote_64bit_integers behavior changed after upgrading to 25.8 — is this expected?
A: Yes. In 25.8, the default changed from 1 (64-bit integers quoted as strings in JSON output) to 0 (output as JSON numbers). If downstream consumers parse ClickHouse JSON output and expect quoted integers, explicitly set output_format_json_quote_64bit_integers = 1 to restore the previous behavior.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.