How to Restart Kafka: Safe Rolling Restart Procedure

Q: What's the difference between `kafka-server-stop.sh` and `kill`?

kafka-server-stop.sh sends SIGTERM to the broker JVM, which triggers controlled shutdown (leadership migration, log flushing). kill -9 sends SIGKILL , which the JVM cannot trap, skipping all of that. Always use the script or systemctl stop .

Q: How do I restart Kafka on Docker / Kubernetes?

On Docker: docker restart - the container's init system handles SIGTERM properly if you used the official images. On Kubernetes: with the Strimzi or Confluent operators, run kubectl annotate pod strimzi.io/manual-rolling-update=true (Strimzi) and let the operator orchestrate. For raw StatefulSets, scale to 0 on one ordinal at a time - but operators are strongly recommended for production.

Q: Should I restart Kafka after every config change?

No. Many broker configs are dynamic. Use kafka-configs.sh --alter to update them at runtime. Only static configs (anything that affects JVM args, port bindings, storage paths) require a restart.

Q: My broker won't shut down cleanly. What now?

Check server.log for Controlled shutdown failed messages - usually it means the controller is unreachable or partitions can't be migrated. Wait a few minutes for retries to complete. If it's truly stuck, increase controlled.shutdown.timeout.ms or force shutdown with kill -TERM (not -9 ); the broker will exit after the JVM-level timeout.

Restarting Kafka the right way matters more than most operators realize. A clean restart of a single broker takes seconds and is invisible to clients. A bad restart - too fast, too many at once, or without checking ISR - can cause unavailable partitions, consumer rebalancing storms, and in the worst case, data loss. This guide walks through the production-safe procedure.

Quick Answer: Restart a Single Broker

# Graceful shutdown
sudo systemctl stop kafka

# Wait for it to fully stop (the JVM flushes logs on shutdown)
# Then start it back up
sudo systemctl start kafka

If you installed Kafka from the Apache tarball without systemd:

/opt/kafka/bin/kafka-server-stop.sh
# wait
/opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.properties

For a single broker in a dev environment, that's all there is to it. For a production cluster, never restart more than one broker at a time, and follow the rolling restart procedure below.

Why Restarts Need Care

A Kafka broker that's stopped takes its leader replicas with it. Until the controller elects new leaders, those partitions can't accept writes. Until follower replicas catch up after the restart, they're under-replicated. During this window:

Producers with acks=all may stall if min.insync.replicas can't be satisfied.
Consumer groups may rebalance, pausing processing.
A second broker failure during the window can cause data loss.

The whole point of the rolling restart procedure is to keep that window short, predictable, and to only ever have one broker in it at a time.

Graceful Shutdown: What's Actually Happening

When you stop a broker, the JVM intercepts SIGTERM and runs the controlled.shutdown sequence:

Migrates leadership for partitions where this broker is the leader to other in-sync replicas.
Flushes all log segments to disk.
Closes file descriptors and the network listener.
Exits.

The relevant settings:

Setting	Purpose	Recommended
`controlled.shutdown.enable`	Enables the graceful path	`true` (default)
`controlled.shutdown.max.retries`	Retries if leader migration fails	`3` (default)
`controlled.shutdown.retry.backoff.ms`	Wait between retries	`5000` (default)

If controlled.shutdown.enable=false, the broker dies without migrating leadership. Don't do that in production.

Production-Safe Rolling Restart

This is the procedure to follow when restarting a multi-broker cluster:

1. Verify the cluster is healthy before starting

Under-replicated partitions, offline partitions, and active reassignments must all be zero:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --list

If any of those return output, fix the underlying issue first. Don't start a rolling restart on a sick cluster.

2. Pick a broker, restart it

# On the chosen broker
sudo systemctl stop kafka

# Wait for the process to fully exit
while pgrep -f kafka.Kafka > /dev/null; do sleep 1; done

# Start back up
sudo systemctl start kafka

3. Wait for the broker to fully rejoin the ISR

After the broker is back, its replicas need to catch up before you move on:

# Check that under-replicated partitions are back to 0
kafka-topics.sh --bootstrap-server any-other-broker:9092 \
  --describe --under-replicated-partitions

The time this takes depends on how long the broker was down and how much data was written in the meantime. On a busy cluster with a long downtime, this can take many minutes. Do not skip this check.

4. (Optional) Rebalance preferred leaders

Leadership doesn't automatically move back when a broker rejoins; the partitions stay on whatever broker took over. To restore the original leader distribution:

kafka-leader-election.sh --bootstrap-server localhost:9092 \
  --election-type preferred --all-topic-partitions

This is safe and triggers a fast leader election back to the preferred broker. Do it after each rolling restart, or set auto.leader.rebalance.enable=true so it happens automatically.

5. Repeat for the next broker

Only after ISR is fully recovered. One broker at a time. Always.

Restarting the Controller

In KRaft mode, the controllers are separate nodes (or co-located with brokers in combined mode). Restart non-active controllers first, then the active one last. You can identify the active controller with:

kafka-metadata-quorum.sh --bootstrap-controller localhost:9093 describe --status

The output lists the LeaderId. Restart the others first so the active controller has somewhere to fail over to.

In legacy ZooKeeper mode, restart ZooKeeper followers first, then the ZooKeeper leader last. Then proceed with Kafka brokers.

Restarting for a Config Change

Many Kafka settings are dynamic and don't require a restart at all:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type brokers --entity-name 1 \
  --alter --add-config log.retention.hours=168

Check the Kafka documentation for which configs are read-only vs cluster-wide vs per-broker dynamic. Restart only when you actually have to.

Common Mistakes

Restarting multiple brokers in parallel. With replication factor 3 and min.insync.replicas=2, losing two brokers at once means writes block. Restart one at a time, always.
Skipping the ISR check. If you restart the next broker before the previous one is fully back in sync, you can end up below min.insync.replicas and stall writes.
kill -9 on a broker. Skips controlled shutdown, leaves leadership on the dead broker until the controller times out, and may leave log segments unflushed. Use SIGTERM (the default for systemctl stop).
Restarting during a partition reassignment. Reassignments are expensive replications. Restarting in the middle aborts and rolls them back. Wait for reassignment to finish.
Forgetting preferred leader election. After several rolling restarts, leadership drifts to whichever brokers took over each time. You end up with hotspot brokers handling more partitions than others.

Monitoring a Rolling Restart

Watch these metrics during and after each broker restart:

Under-replicated partitions (target: 0 in steady state, briefly non-zero during restart)
Offline partitions (target: always 0)
ISR shrinks/expands (transient spikes during restart are expected)
Active controller count (must be exactly 1 across the cluster)
Produce/consume request rates (should recover within seconds)
Consumer group lag for critical consumers

Pulse tracks all of these in real time and will flag any anomaly during a rolling restart - including stuck reassignments, ISR shrinks that don't recover, and leadership imbalance. Start a free trial to see your cluster's restart safety in one place.

Frequently Asked Questions

Q: How long does a Kafka broker restart take?
A: A clean restart on a healthy broker is typically 30 seconds to a few minutes - JVM startup plus log segment recovery on the partitions this broker hosts. Brokers that were down for a long time take longer because their replicas have to catch up to the leaders. Brokers with millions of log segments can spend additional minutes on segment recovery at startup.

Q: Can I restart Kafka without downtime?
A: Yes, if you have replication factor >= 2 and follow the rolling restart procedure. Producers and consumers experience a brief blip during leader election (sub-second on a healthy cluster) but no downtime.

Q: What's the difference between kafka-server-stop.sh and kill?
A: kafka-server-stop.sh sends SIGTERM to the broker JVM, which triggers controlled shutdown (leadership migration, log flushing). kill -9 sends SIGKILL, which the JVM cannot trap, skipping all of that. Always use the script or systemctl stop.

Q: How do I restart Kafka on Docker / Kubernetes?
A: On Docker: docker restart <container> - the container's init system handles SIGTERM properly if you used the official images. On Kubernetes: with the Strimzi or Confluent operators, run kubectl annotate pod <broker-pod> strimzi.io/manual-rolling-update=true (Strimzi) and let the operator orchestrate. For raw StatefulSets, scale to 0 on one ordinal at a time - but operators are strongly recommended for production.

Q: How do I restart Kafka on AWS MSK?
A: You can't directly. MSK only exposes broker reboots through its API, and they happen one broker at a time automatically. Use aws kafka reboot-broker --cluster-arn <arn> --broker-ids <id>. For config changes, MSK applies them via a managed rolling restart when you update the cluster configuration.

Q: Should I restart Kafka after every config change?
A: No. Many broker configs are dynamic. Use kafka-configs.sh --alter to update them at runtime. Only static configs (anything that affects JVM args, port bindings, storage paths) require a restart.

Q: My broker won't shut down cleanly. What now?
A: Check server.log for Controlled shutdown failed messages - usually it means the controller is unreachable or partitions can't be migrated. Wait a few minutes for retries to complete. If it's truly stuck, increase controlled.shutdown.timeout.ms or force shutdown with kill -TERM (not -9); the broker will exit after the JVM-level timeout.