After years of operating Kafka at scale, the patterns that separate well-run topics from constant-incident topics come down to a handful of decisions made at design time. This guide collects the production best practices that matter most: how to name topics, how many partitions, what to put in retention and replication, how to handle schemas, and the operational defaults that keep things boring.
This isn't a Kafka 101 - it assumes you know what topics, partitions, and consumers are. If you want the foundations first, start with what is a Kafka topic.
Naming
A consistent naming convention prevents half of all "what is this topic for?" Slack threads.
Good defaults:
- Lowercase, hyphen-separated:
user-events,orders-paid,clickstream-web. - Include the data domain:
payments.invoices.createdorpayments-invoices-created. - Include schema version when relevant:
user-events-v2. Don't include the version in v1. - Avoid mixing
.and_in the same name (Kafka collapses them in metrics names, causing collisions).
Templates that work:
<domain>.<entity>.<event> # payments.invoice.created
<domain>-<entity>-<event> # payments-invoice-created
<environment>.<domain>.<entity> # prod.payments.invoice
Pick one. Document it. Lint topic names in CI if you can.
Avoid:
temp-*,test-*in production clusters (someone will forget to delete them).- One topic per customer or tenant (use keys, not topics, for that dimension - see how many topics can Kafka support).
- Reusing topic names for different schemas. Always create a new versioned topic when the schema changes incompatibly.
Partitions
Partition count is the single most important and most regret-prone topic decision.
Pick partition count based on:
- Peak consumer parallelism. A consumer group has at most
partitionsconcurrent workers. Pick enough for your peak throughput needs - typically 1-2x your highest expected consumer count. - Target throughput. Plan ~25 MB/s per partition for sustained production workloads. See Kafka throughput per partition.
- Growth headroom. You can add partitions later, but doing so changes key-to-partition assignment and breaks per-key ordering across the boundary. Add some margin up front (50-100%) so you don't have to.
Don't:
- Set partitions blindly to 1 ("we'll scale later"). Adding partitions later is disruptive.
- Set partitions to 100+ "just in case." Each partition costs memory, file descriptors, and replication overhead. Over-partitioning hurts everyone.
- Vary partition counts wildly across topics in the same domain. It complicates routing and monitoring.
Good defaults for most workloads:
- Small / low-throughput topics: 3-6 partitions.
- Standard production topics: 12-24 partitions.
- High-throughput topics: 30-60+ partitions, sized to peak throughput.
Replication
Always RF=3 in production. Always. There is no good reason to run RF=1 (you lose data on a single broker failure) or RF=2 (a broker failure leaves you with no redundancy).
Pair it with:
replication.factor=3
min.insync.replicas=2
acks=all (on the producer)
This combination is what gives you durability. acks=all without min.insync.replicas=2 lets a single in-sync replica acknowledge writes, which is exactly the case where data loss happens. Set them together or not at all.
For development clusters, RF=1 is fine. For production, RF=3 across three different racks or availability zones is the baseline. RF=4+ adds replication cost without meaningful additional durability for most use cases.
Retention
retention.ms and retention.bytes bound how much data the topic keeps. The defaults (7 days, unlimited bytes) are reasonable starting points but rarely the right answer.
Decide by workload:
| Workload | retention.ms |
retention.bytes |
|---|---|---|
| Real-time events with short replay window | 24-72 hours | per-broker budget |
| Standard event streams | 7-14 days | per-broker budget |
| Audit / compliance logs | 30+ days (or compliance requirement) | uncapped, plan disk |
| State / configuration via compaction | unlimited | unlimited (use cleanup.policy=compact) |
| Stream processing checkpoints / changelogs | unlimited (compacted) | unlimited |
Always set retention.bytes as a safety net even if you also set retention.ms. A traffic spike can fill the disk before time-based retention kicks in. Per-partition retention.bytes * partitions * replication factor must fit within a comfortable fraction of broker disk (60-70%).
Cleanup Policy
Two options, sometimes combined:
delete(default): drops old messages by time or size. Use for events that age out.compact: keeps the latest message per key forever. Use for state (configuration, materialized views, CDC).compact,delete: keep the latest per key, but also age them out after a maximum. Useful for slowly-changing state with bounded retention.
Picking the wrong one is a common mistake:
- Event streams on
compactcauses events with repeated keys to disappear. Bad if events are append-only by intent. - State topics on
deletecauses state to evaporate when you need it most. Bad for changelogs, KTables, materializations.
Rule of thumb: if "the latest one wins" is meaningful, compact. If every record is an independent fact, delete.
Compression
Set compression.type per topic, not per producer if you can help it. producer (let the producer's compression flow through) is a reasonable default. Otherwise:
lz4: fastest, modest compression ratio. Good default for high-throughput, latency-sensitive workloads.zstd: better compression, slightly more CPU. Best balance for storage-bound workloads.snappy: legacy default. Use lz4 or zstd instead in new topics.gzip: highest compression, very high CPU. Avoid unless you specifically need it.
Enabling compression typically doubles effective throughput on text-heavy data and halves storage cost. There's almost never a reason not to use it.
Schema Management
Topics shouldn't carry mixed schemas. Producers writing different shapes to the same topic force every consumer to handle every variation. Two patterns work:
1. Schema Registry + structured formats
Use Avro, Protobuf, or JSON Schema with Confluent Schema Registry (or an equivalent). The registry enforces compatibility (backward, forward, or full) and prevents incompatible producer changes.
2. Topic per event type
Even with structured formats, separate topics for fundamentally different events. user-events for everything user-related is too broad; user-created, user-updated, user-deleted is better.
Either way, version explicitly:
- Compatible schema change: same topic, new schema version.
- Incompatible schema change: new topic (e.g.,
orders-v2), dual-write for a deprecation window, cut consumers over.
Topic Configuration Defaults
A reasonable production starter set:
num.partitions=12
default.replication.factor=3
min.insync.replicas=2
log.retention.hours=168 # 7 days
log.retention.bytes=-1 # set per-topic instead
log.segment.bytes=1073741824 # 1 GB
compression.type=producer
cleanup.policy=delete
unclean.leader.election.enable=false
auto.create.topics.enable=false
Per-topic, override what makes sense:
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name orders \
--alter --add-config \
retention.ms=1209600000,retention.bytes=53687091200,compression.type=zstd
Operational Practices
Disable auto-creation. auto.create.topics.enable=false on brokers. Topics should be created explicitly via CLI or IaC (Terraform, Strimzi, ACL-aware tools). Auto-creation makes typos and stale code into permanent topics with default (wrong) settings.
Use IaC for topics. Define topics declaratively in Terraform, Pulumi, or a Kafka-native operator. Manual kafka-topics.sh --create is fine for one-offs but doesn't scale, and there's no audit trail.
Plan deletion carefully. Topic deletion is irreversible. Have a process for deprecation: mark deprecated, monitor producers/consumers for a few weeks, then delete. The kafka-topics.sh --delete command is destructive even with delete.topic.enable=true.
Set per-topic ACLs. Production clusters should require ACLs. Producers only have write permission on their topics; consumers only have read and describe. Avoid the "everyone has access to everything" cluster.
Common Mistakes
- One topic per tenant. Doesn't scale. Use a key for tenant identity, ACLs/processing for isolation. See how many topics can Kafka support.
replication.factor=1in production. A single broker failure loses data permanently.- Forgetting
min.insync.replicas.acks=allalone doesn't guarantee durability. - Mixing event types in one topic. Forces every consumer to know every event shape. Separate topics or enforce schemas.
- Setting retention only by time. A traffic spike fills disk before time retention kicks in. Always set
retention.bytesas well. - Enabling
unclean.leader.election.enable. Trades durability for availability. Almost never the right call in production. - Skipping monitoring on under-replicated partitions. This is the leading indicator of trouble. Alert on any non-zero value sustained for more than a minute.
- Letting topic count grow without governance. "We'll clean it up later" never happens. Inventory and prune quarterly.
Monitoring Kafka Topics
Per-topic metrics worth tracking:
- Under-replicated partitions - any non-zero value sustained is a problem.
- Bytes in / bytes out - capacity planning and anomaly detection.
- Consumer lag per group, per topic - growing lag indicates consumers can't keep up. See consumer lag.
- Log size on disk - validates retention is working.
- Leader election rate - frequent leader changes indicate instability.
- Producer error rate per topic - schema violations, ACL failures, broker rejections.
- Topic creation/deletion rate - sudden spikes usually mean something is wrong upstream.
Pulse provides AI-powered monitoring for Kafka with per-topic visibility into under-replicated partitions, consumer lag, leader imbalance, schema violations, and capacity trends - across all your clusters, with automated root cause analysis. Start a free trial to see what your topics look like to a system that's been tuned for this.
Frequently Asked Questions
Q: What's a good default number of partitions for a Kafka topic?
A: 12 is a reasonable default for most use cases. It supports up to 12 parallel consumers, gives room to grow, and isn't overkill for low-volume topics. For high-throughput topics, size up to peak throughput (assume ~25 MB/s per partition).
Q: What's the best Kafka topic naming convention?
A: Lowercase, hyphen- or dot-separated, with domain and entity in the name: payments.invoice.created, users-profile-updated. Pick one separator (don't mix . and _ because of metrics name collisions). Avoid environment prefixes if you use separate clusters per environment.
Q: Should I create one topic per microservice or per event type?
A: Per event type, almost always. Multiple services emit and consume the same event type, and topics with mixed schemas force every consumer to handle every variant. One service can own multiple topics; one topic should have one schema.
Q: What replication factor should I use?
A: 3 in production, across three different racks or availability zones. RF=1 is dev-only. RF=2 is half a step that leaves you no redundancy after one failure. RF=4+ rarely adds practical durability over RF=3.
Q: How long should I retain Kafka topic data?
A: It depends. Event streams: 7-14 days is typical. Compacted state topics: indefinitely. Audit logs: per compliance requirements. Always set retention.bytes as a safety net even if you also set retention.ms.
Q: Should I enable auto-create-topics in production?
A: No. Auto-create lets typos and stale code create permanent topics with default settings that may not match your durability or retention requirements. Disable it and create topics explicitly via IaC or runbook.
Q: How do I change partition count safely?
A: You can increase partitions on an existing topic (kafka-topics.sh --alter --partitions <new-count>) but you cannot decrease them. Increasing changes key-to-partition assignment, breaking per-key ordering for messages produced before vs after the change. Plan partition counts up front; if you must change, plan a migration.
Q: When should I use a compacted topic instead of a regular one?
A: When "the latest value per key" is the meaningful state, and earlier values can be discarded. Examples: user profiles, latest configuration, materialized views, Kafka Streams state. For append-only event streams where every record matters, use cleanup.policy=delete.