ClickHouse 'Possible Deadlock Avoided' Error: Client Retry Logic and Recovery Strategies

ClickHouse raises DB::Exception: ... locking attempt on '<table>' has timed out! Possible deadlock avoided. Client should retry (error code 473, DEADLOCK_AVOIDED) when a query cannot acquire a table lock within the configured timeout. The message is deliberate: rather than risk hanging forever in a real deadlock, ClickHouse aborts the lock attempt and tells the client to try again.

This is one of the few ClickHouse errors where the recommended fix is literally in the message — retry the query. But a steady stream of these errors signals real lock contention that retries alone will not solve. This guide explains what the lock is protecting, how to retry correctly, how to tune the timeout, and how to remove the contention at its source.

What the Error Actually Means

ClickHouse uses a read/write lock (RWLock) per table to coordinate operations that change table structure or metadata against operations that read it. A query that needs the lock waits up to lock_acquire_timeout (default 120 seconds, hence the familiar 120000ms in the message). If it still cannot get the lock, ClickHouse throws DEADLOCK_AVOIDED instead of waiting indefinitely.

Typical message forms:

DB::Exception: WRITE locking attempt on 'mydb.mytable' has timed out!
(120000ms) Possible deadlock avoided. Client should retry. (DEADLOCK_AVOIDED)

DB::Exception: READ locking attempt on 'mydb.mytable' has timed out!
(120000ms) Possible deadlock avoided. Client should retry. (DEADLOCK_AVOIDED)

WRITE locking means a structure-changing operation (such as an ALTER or TRUNCATE) could not get an exclusive lock; READ locking means a query that needs a shared lock was blocked, usually because a writer is queued ahead of it (the lock is phase-fair).

This error is not about row-level data consistency — ClickHouse does not use row locks for normal INSERT/SELECT. It is purely about table-level metadata/structure coordination. If you are chasing concurrent UPDATE/DELETE behavior instead, see ClickHouse Mutations and UPDATE/DELETE Performance.

Common Triggers

The error almost always involves a structure-changing statement contending with long-running readers (or vice versa):

  • ALTER TABLE operationsADD/DROP/MODIFY COLUMN, ATTACH/DROP/MOVE PARTITION, and TTL changes need a write lock. See Using ALTER TABLE in ClickHouse.
  • TRUNCATE TABLE and DROP TABLE on busy tables.
  • Long-running SELECTs that hold a read lock for the duration of the query, blocking a queued writer.
  • Queries against system.tables, system.columns, system.parts, etc. that touch many tables while a schema change runs — a classic historical cause.
  • Background merges/mutations competing with foreground DDL (governed by a separate timeout, see below).

A single slow SELECT holding a read lock can cause a queued ALTER to time out, and every new SELECT that arrives after the ALTER is queued can also time out — so one stuck operation can briefly cascade into many DEADLOCK_AVOIDED errors.

First Response: Retry Correctly

Because the operation aborted before doing anything, it is safe to retry. The error is transient by design. Implement retries with exponential backoff and jitter rather than tight loops, which only add to the contention:

import time, random
import clickhouse_connect

DEADLOCK_AVOIDED = 473

def run_with_retry(client, query, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return client.command(query)
        except Exception as e:
            # clickhouse-connect surfaces the server error code
            if getattr(e, "code", None) != DEADLOCK_AVOIDED or attempt == max_attempts - 1:
                raise
            backoff = min(2 ** attempt, 30) + random.uniform(0, 1)
            time.sleep(backoff)
    # unreachable

Guidelines for retry logic:

  • Only retry on code 473. Do not blanket-retry every exception — many other errors are not safe or useful to repeat.
  • Use backoff with jitter. Immediate retries against a held lock just re-time-out 120 seconds later.
  • Cap attempts. If a DDL statement fails 5 times, the contention is structural and needs a root-cause fix, not more retries.
  • Idempotency is mostly free here, because the statement aborts before applying. The exception is multi-step client logic — make sure the surrounding workflow can re-run safely.

For background on related transient-error handling, see Handling the ClickHouse ABORTED Error.

Diagnosing the Contention

Before tuning timeouts, find out who holds the lock. Inspect currently running queries:

SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage) AS mem,
    substring(query, 1, 120) AS query
FROM system.processes
ORDER BY elapsed DESC;

The longest-running query is usually the lock holder. To see what is queued or recently failed with this error, check the query log:

SELECT
    event_time,
    query_duration_ms,
    exception,
    substring(query, 1, 150) AS query
FROM system.query_log
WHERE exception_code = 473
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 50;

If a single long SELECT is the culprit, you can release the lock by cancelling it:

KILL QUERY WHERE query_id = '<offending-query-id>';

Tuning Lock Timeouts

Two settings govern how long lock acquisition waits before raising DEADLOCK_AVOIDED:

Setting Scope Default What it controls
lock_acquire_timeout Session/query setting 120 (seconds) How long a query waits to acquire each table lock before failing.
lock_acquire_timeout_for_background_operations MergeTree table setting 120 (seconds) How long background operations (merges, mutations) wait for table locks.

You can raise lock_acquire_timeout for a specific statement that legitimately needs to wait through a long reader:

ALTER TABLE events DROP PARTITION '202401'
SETTINGS lock_acquire_timeout = 300;

Or set it at the session level for a maintenance job:

SET lock_acquire_timeout = 300;

For the table-level background setting:

ALTER TABLE events
    MODIFY SETTING lock_acquire_timeout_for_background_operations = 300;

Caution: Increasing the timeout does not fix contention — it just lets the operation wait longer before giving up, and it makes a genuine deadlock take longer to surface. Treat timeout tuning as a stopgap while you address the underlying cause. Conversely, a very short timeout combined with aggressive retries can keep contention churning; prefer fixing the workload.

Root-Cause Fixes

Most persistent DEADLOCK_AVOIDED problems come down to mixing heavy DDL with heavy reads on the same table at the same time. Durable fixes:

  1. Schedule DDL during quiet windows. Run ALTER, TRUNCATE, and partition operations when long analytical queries are not active.
  2. Shorten the lock-holding queries. A SELECT that runs for minutes holds its read lock for minutes. Optimize it (better primary key, narrower scans, projections) so it releases the lock sooner.
  3. Avoid broad system.* scans during schema changes. Monitoring that queries system.tables/system.columns across every table can collide with a concurrent ALTER; scope those queries or stagger them.
  4. Upgrade ClickHouse. Lock handling has improved substantially over time — many table-level locks for ALTERs on MergeTree tables were removed or downgraded to shared locks in later releases, making these timeouts far rarer. Older releases are much more prone to this error than current 24.x/25.x.
  5. Confirm you are on the Atomic database engine. Atomic (the default for new databases) enables lock-free DROP and reduces metadata-lock contention. Check with SELECT name, engine FROM system.databases;.

If the operation that times out is specifically a stuck ALTER ... MODIFY COLUMN, the lock may be a symptom of a long-running mutation rather than a concurrent reader — see Troubleshooting a Stuck ALTER MODIFY COLUMN.

Best Practices

  • Always wrap DDL and ingestion in retry-with-backoff logic. Even on a healthy cluster, brief contention happens; transient DEADLOCK_AVOIDED should be invisible to your application.
  • Retry only error 473, with capped attempts and jittered exponential backoff.
  • Monitor system.query_log for exception_code = 473. A rising count is an early warning of growing contention, not a cosmetic issue.
  • Keep lock-holding queries short and run structural changes off-peak.
  • Tune lock_acquire_timeout deliberately, as a temporary measure, not as the primary fix.
  • Stay current on ClickHouse versions to benefit from reduced locking.

Common Issues

  • Retries that never succeed. If every retry fails, a long-running query or stuck mutation is permanently holding the lock. Find it in system.processes / system.mutations and KILL it rather than retrying forever.
  • Blanket retrying all exceptions. Retrying non-473 errors hides real failures (syntax errors, type mismatches, out-of-memory) and wastes resources. Filter on the error code.
  • Raising the timeout and calling it fixed. A higher timeout converts fast failures into slow ones; the contention is still there. Pair any timeout change with a root-cause investigation.
  • Background merges timing out. If you see lock timeouts in system.text_log/server logs from merges or mutations, tune lock_acquire_timeout_for_background_operations and reduce concurrent foreground DDL.

How Pulse Helps

Pulse continuously monitors ClickHouse clusters and surfaces lock contention before it turns into a wall of DEADLOCK_AVOIDED errors. It correlates DEADLOCK_AVOIDED occurrences in the query log with the specific long-running queries, DDL statements, and background mutations holding the locks, so you can see why the timeouts happen rather than just that they happened. Pulse also flags clusters running older versions with known aggressive locking behavior and recommends safe timeout and scheduling adjustments — turning a recurring, hard-to-reproduce production incident into an actionable, root-caused finding.

Frequently Asked Questions

Q: Is it safe to automatically retry on "Possible deadlock avoided"?

Yes. The operation aborts before making any changes, so retrying a single statement is safe. Use capped, jittered exponential backoff and retry only on error code 473. Make sure any multi-step workflow around the statement is also safe to re-run.

Q: What is the default lock timeout, and where does "120000ms" come from?

The default lock_acquire_timeout is 120 seconds, which is why the message reports 120000ms. After waiting that long for a table lock, ClickHouse raises DEADLOCK_AVOIDED (error code 473) instead of blocking indefinitely.

Q: Should I just increase lock_acquire_timeout to make the error go away?

Only as a stopgap. A higher timeout lets operations wait longer through legitimate contention, but it does not remove the contention and makes a genuine deadlock take longer to surface. Fix the underlying cause — shorten lock-holding queries and schedule DDL off-peak.

Q: Does this error mean my data is corrupted or my write was lost?

No. It is a table-level locking timeout, not a data error. The aborted operation made no changes. Nothing was written or partially applied, which is exactly why retrying is safe.

Q: Why does a read-only SELECT get this error?

ClickHouse's read/write lock is phase-fair: if a writer (for example an ALTER) is queued, new readers wait behind it. A queued SELECT can therefore time out with READ locking attempt ... Possible deadlock avoided even though it is not itself modifying anything.

Q: How do I find what is holding the lock?

Query system.processes ordered by elapsed to find the longest-running query (usually the lock holder), and check system.query_log for exception_code = 473 to see what is timing out. If a single query is the cause, release the lock with KILL QUERY WHERE query_id = '...'.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.