Running ClickHouse on Kubernetes: Operator, Configuration, and Production Pitfalls

ClickHouse on Kubernetes is non-trivial in ways that stateless workloads are not. ClickHouse is IO-bound, merge-heavy, and relies on stable node identity for replication coordination. Kubernetes was designed around disposable pods and network-attached storage - assumptions that conflict with what ClickHouse wants from its infrastructure. You can make it work well, but the defaults will fight you.

There are two operators to consider:

Altinity clickhouse-operator — the older, battle-tested option with wide production adoption. It uses ClickHouseInstallation (CHI) and ClickHouseKeeperInstallation (CHK) CRDs and powers Altinity's commercial cloud platform.
ClickHouse Kubernetes Operator — the newer operator maintained by ClickHouse Inc, released under Apache 2.0. It uses ClickHouseCluster and KeeperCluster CRDs, defaults to DatabaseReplicated (which eliminates ON CLUSTER clauses), and takes a thin-layer design that delegates complex logic to ClickHouse's C++ internals.

Both are viable. The Altinity operator has a longer track record and more documented production deployments; the ClickHouse Inc operator is the strategic upstream choice if you want to stay close to the project. This article covers both.

The Altinity Operator: What It Does and How to Install It

The operator watches for ClickHouseInstallation (CHI) and ClickHouseKeeperInstallation (CHK) custom resources, then reconciles the cluster state - creating StatefulSets, headless Services, ConfigMaps, and PersistentVolumeClaims. When you update a CHI spec, the operator performs rolling restarts across shards and replicas in the correct order. Without the operator, you would manage all of this by hand across multiple StatefulSets, which becomes error-prone quickly.

Installation is a single kubectl apply:

kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

Once the operator pod is running, you define your cluster with a ClickHouseInstallation resource. Here is a real example for a two-shard, two-replica cluster with persistent storage and pod anti-affinity:

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: ch-prod
  namespace: clickhouse
spec:
  configuration:
    clusters:
      - name: prod
        layout:
          shardsCount: 2
          replicasCount: 2
    zookeeper:
      nodes:
        - host: clickhouse-keeper-0.clickhouse-keeper.clickhouse.svc.cluster.local
          port: 9181
        - host: clickhouse-keeper-1.clickhouse-keeper.clickhouse.svc.cluster.local
          port: 9181
        - host: clickhouse-keeper-2.clickhouse-keeper.clickhouse.svc.cluster.local
          port: 9181
  defaults:
    templates:
      podTemplate: clickhouse-pod-template
      dataVolumeClaimTemplate: data-volume-template
      logVolumeClaimTemplate: log-volume-template
  templates:
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: clickhouse.altinity.com/chi
                        operator: In
                        values: ["ch-prod"]
                  topologyKey: kubernetes.io/hostname
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:24.8
              resources:
                requests:
                  cpu: "4"
                  memory: "16Gi"
                limits:
                  cpu: "8"
                  memory: "32Gi"
    volumeClaimTemplates:
      - name: data-volume-template
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: local-nvme
          resources:
            requests:
              storage: 500Gi
      - name: log-volume-template
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: standard
          resources:
            requests:
              storage: 10Gi

The operator translates this into four StatefulSets (one per replica per shard), each with its own PVC, headless Service, and ClickHouse configuration XML. The zookeeper block is how the operator knows which Keeper ensemble to point ClickHouse at - without this, replicated tables cannot commit inserts.

The ClickHouse Inc Operator

ClickHouse Inc's operator is installed via kubectl, Helm, or the Operator Lifecycle Manager (OLM). It requires cert-manager for webhook TLS, so install that first:

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Install the ClickHouse operator via Helm
helm repo add clickhouse-operator https://charts.clickhouse.com
helm install clickhouse-operator clickhouse-operator/clickhouse-operator --namespace clickhouse-operator --create-namespace

The operator manages two CRDs: ClickHouseCluster for the database nodes and KeeperCluster for the coordination layer. A minimal replicated cluster looks like:

apiVersion: clickhouse.com/v1alpha1
kind: KeeperCluster
metadata:
  name: keeper
  namespace: clickhouse
spec:
  replicas: 3
---
apiVersion: clickhouse.com/v1alpha1
kind: ClickHouseCluster
metadata:
  name: chi
  namespace: clickhouse
spec:
  replicas: 2
  shards: 2
  keeper:
    name: keeper

A notable design difference from the Altinity operator: ClickHouseCluster defaults to DatabaseReplicated as the database engine. This means tables created inside a DatabaseReplicated database replicate automatically without requiring ON CLUSTER in every DDL statement - a significant ergonomic improvement for multi-shard setups. The trade-off is that DatabaseReplicated is newer and has edge cases that ReplicatedMergeTree with explicit ON CLUSTER does not.

The ClickHouse Inc operator creates one StatefulSet per replica, enabling independent version and configuration management across replicas - useful for staged rolling upgrades. TLS for inter-node and client communication is natively supported through cert-manager integration.

Which operator to choose: If you are starting fresh and comfortable with newer tooling, the ClickHouse Inc operator is the natural upstream choice and will see the most active development. If you are operating an existing cluster, migrating between operators is non-trivial and typically not worth it unless you have specific reasons. Both operators support ClickHouse Keeper, Prometheus metrics export, and persistent volume claim templates.

ClickHouse Keeper on Kubernetes

ClickHouse Keeper is ClickHouse's built-in replacement for ZooKeeper, implementing the same client protocol over a Raft-based consensus layer. For new deployments, Keeper is the right choice - it removes the ZooKeeper operational dependency and has lower latency for the small, frequent writes that ClickHouse replication generates.

Keeper must run as a separate StatefulSet. Running Keeper co-located inside ClickHouse pods creates a circular dependency: ClickHouse cannot initialize replicated tables until Keeper reaches quorum, but Keeper nodes embedded in ClickHouse pods may not start in a deterministic order. Keep them completely separate. The minimum production quorum is three nodes. Two nodes cannot form a majority on partition, which means a single node failure halts all replication commits across the entire ClickHouse cluster.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: clickhouse-keeper
  namespace: clickhouse
spec:
  serviceName: clickhouse-keeper
  replicas: 3
  selector:
    matchLabels:
      app: clickhouse-keeper
  template:
    metadata:
      labels:
        app: clickhouse-keeper
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: clickhouse-keeper
              topologyKey: kubernetes.io/hostname
      containers:
        - name: clickhouse-keeper
          image: clickhouse/clickhouse-keeper:24.8
          ports:
            - containerPort: 9181
            - containerPort: 9234
          volumeMounts:
            - name: keeper-data
              mountPath: /var/lib/clickhouse
  volumeClaimTemplates:
    - metadata:
        name: keeper-data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 20Gi

The Keeper server_id for each pod comes from an init container or a config rendered using the pod ordinal. Without the correct server_id assigned per ordinal, Raft peer discovery fails on pod restart. The Altinity operator's ClickHouseKeeperInstallation CRD handles this automatically if you use it instead of a bare StatefulSet.

Keeper data is lightweight - it stores replication log metadata, not actual data parts - so 20Gi per node is typically sufficient. What matters more is low-latency storage: Keeper's Raft log needs fast fdatasyncs, and high-latency network storage here translates directly into replication latency across the whole cluster.

Storage Configuration

ClickHouse's performance is dominated by IO. The merge process - which compacts data parts in the background - is sequential, high-throughput IO, and the mark files, primary index, and skip indexes all live on the same filesystem as the column data. This is not a workload where you want network-attached storage with unpredictable tail latency.

For hot data, use local SSDs. On GKE, local NVMe storage requires the local static provisioner, which creates a StorageClass (conventionally named local-nvme or similar) — there is no local-ssd StorageClass built into GKE. On AWS, gp3 EBS volumes are the recommended alternative if local NVMe is unavailable (they support up to 1,000 MiB/s throughput and are more cost-effective than io2 for ClickHouse workloads). The key requirement is predictable sub-millisecond latency and stable throughput for concurrent reads and writes during merges. NFS and CIFS are explicitly unsupported by ClickHouse for data storage; they break atomic renames that the MergeTree storage engine relies on.

For cold or archival data, ClickHouse's tiered storage (configured via storage_policies in config.xml) can offload older parts to S3 or GCS while keeping recent data on local disk. This works well in Kubernetes by using ClickHouse's native S3 disk type with per-pod IAM credentials via workload identity.

Sizing PVCs correctly upfront matters. ClickHouse replication does not share storage - each replica holds its own full copy of all data. A cluster with two replicas doubles your storage cost compared to a single node. Plan for peak storage plus headroom for the parts that accumulate before merges complete; parts_to_delay_insert kicks in well before disk is full.

Production Pitfalls

The anti-affinity rule in the CHI spec above is not optional. Without requiredDuringSchedulingIgnoredDuringExecution with topologyKey: kubernetes.io/hostname, Kubernetes may co-locate two replicas of the same shard on the same node. When that node drains or crashes, you lose both replicas simultaneously, turning a hardware failure into a table-level outage.

Resource limits require care specific to ClickHouse. ClickHouse's memory allocator (jemalloc, tracked internally) does not perfectly align with cgroup memory limits visible to the kernel. If you set a container memory limit, configure max_server_memory_usage in config.xml to 75–80% of that limit (to leave headroom for OS page cache and background merge memory), and set max_memory_usage in users.xml to limit individual query memory. Without these settings, the OOM killer terminates the process at unpredictable moments during large merges or complex aggregations. Since ClickHouse 22.2, cgroup CPU limits are read for max_threads auto-detection, but cgroup v2 support has been incremental — verify that ClickHouse detects the correct core count for your environment and set max_threads explicitly in users.xml if it does not.

Node drain during a rolling Kubernetes upgrade is where replication correctness gets tested. If a node is drained while a replica is behind on replication (lagging parts), the replica goes offline before it catches up, and the remaining replica carries the full write load. ClickHouse will not lose data - writes to ReplicatedMergeTree require acknowledgment from the Keeper log, not from all replicas - but queries that use quorum consistency or replica-aware load balancing may return stale results until the lagging replica catches up post-reschedule. A PodDisruptionBudget with maxUnavailable: 1 per shard is the minimum safeguard; pair it with a pre-drain check that verifies replica lag is within an acceptable threshold before the node drains.

One more operational reality: ClickHouse schema changes (especially mutations like ALTER TABLE ... UPDATE or ALTER TABLE ... DELETE) run as background operations coordinated through Keeper. On Kubernetes, pods restart during upgrades, and a mutation that starts before a restart may pause and resume slowly afterward, depending on background thread availability. Monitor system.mutations for stuck or long-running mutations after any rolling restart.