Kafka Fundamentals: Topics, Partitions, and Consumer Groups


Kafka comes up in most backend engineering roles eventually, usually when someone says “we need to decouple these services” or “we need to handle events at scale.” Most developers know Kafka is a message queue of some kind, that it’s fast, and that it’s used for real-time data. What’s less clear is why it works differently from something like RabbitMQ, and why those differences matter.

The answer is in how Kafka stores and delivers messages - and once you see the model, the rest of it makes sense.

What Makes Kafka Different

A traditional message queue like RabbitMQ is a broker that routes messages from producers to consumers and deletes them once they’re acknowledged. It’s a temporary buffer. Messages don’t persist; once consumed, they’re gone.

Kafka is a distributed commit log. Messages are written to disk and retained for a configurable period (days, weeks, forever). Consumers read from the log by position - they track where they are, not the broker. Multiple consumers can read the same message independently.

This changes what you can do with it:

  • Multiple systems can consume the same event stream independently (analytics, billing, notifications - all reading the same order events without coordination)
  • Consumers can replay history - restart from the beginning of the log or from any past position
  • A consumer that falls behind doesn’t lose messages - it just has more to catch up on when it resumes
  • Message delivery guarantees are configurable per producer, not a global broker property

Topics and Partitions

A topic is a named stream of messages. You produce order events to an “orders” topic, user events to a “users” topic. Topics are how you separate concerns.

A partition is where the actual storage happens. Each topic is divided into one or more partitions, each of which is an ordered, immutable sequence of messages stored on disk. Messages within a partition have sequential offsets: message 0, message 1, message 2, and so on. Each message is identified by its topic, partition, and offset.

Topic: "orders"
  Partition 0: [msg0] [msg1] [msg2] [msg3] ...
  Partition 1: [msg0] [msg1] [msg2] ...
  Partition 2: [msg0] [msg1] [msg2] [msg3] [msg4] ...

Partitions are the unit of parallelism. A topic with one partition can only be read by one consumer at a time. A topic with twelve partitions can be read by up to twelve consumers in parallel.

Partition assignment for producers: when you produce a message, Kafka decides which partition it goes to. By default, messages are round-robined across partitions. If you provide a message key, Kafka hashes the key and always routes messages with the same key to the same partition. This matters when you need ordering - Kafka guarantees ordering within a partition, not across partitions.

# Messages with the same key go to the same partition
# If you need all events for order_id=123 to be in order, use the order_id as the key
producer.produce(
    topic="orders",
    key="order-123",
    value=json.dumps({"order_id": "123", "status": "shipped"})
)

Consumer Groups

A consumer group is a set of consumers that cooperate to consume a topic. Kafka assigns partitions to consumers in the group so that each partition is read by exactly one consumer at a time. This is the parallelism mechanism: add consumers to the group to increase throughput.

Topic "orders" with 6 partitions
Consumer Group "billing-service" with 3 consumers:

  Consumer 1: reads Partitions 0, 1
  Consumer 2: reads Partitions 2, 3
  Consumer 3: reads Partitions 4, 5

If a consumer in the group fails, Kafka reassigns its partitions to the remaining consumers (a rebalance). When you add a consumer, partitions are redistributed.

The number of consumers in a group that actively read is bounded by the number of partitions. If you have 6 partitions and 8 consumers in a group, 2 consumers will sit idle. If you want more parallelism, increase the partition count.

Multiple independent consumer groups can read the same topic without affecting each other. Each group maintains its own offsets - the “billing-service” group and the “analytics-service” group both get every order event, independently.

Offsets and Delivery Guarantees

Kafka tracks each consumer group’s position in each partition as an offset - which message the group has read up to. Offsets are committed back to Kafka (stored in a special internal topic called __consumer_offsets).

This is where delivery semantics live:

At most once: commit the offset before processing the message. If processing fails, the message is lost but never re-processed. Rarely the right choice.

At least once: commit the offset after processing successfully. If the consumer crashes between processing and committing, the message will be re-processed. Messages can be processed more than once. This is the common default.

Exactly once: Kafka supports exactly-once semantics through idempotent producers and transactional APIs, but it requires both producer and consumer to participate. Complex to implement correctly.

For most use cases, at-least-once with idempotent message processing is the practical answer: design your consumers to handle duplicates (idempotency keys, deduplication by message ID), and use at-least-once delivery.

Retention and the Log

Messages in Kafka are retained based on time or size, configurable per topic:

log.retention.hours=168  # Keep 7 days
log.retention.bytes=107374182400  # Or until partition reaches 100GB

After retention expires, old messages are deleted (log compaction is an alternative mode where Kafka keeps only the latest message per key, useful for maintaining current state).

The retention period is what makes Kafka useful as a replay mechanism. A new service consuming order events can start from the beginning of the log and process historical events. A bug in a consumer? Fix it, reset the offset to before the bug was introduced, and reprocess.

Producers and Durability

When a producer writes a message, it can wait for different levels of acknowledgment:

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all'  # Wait for all replicas to acknowledge
)

acks=0: no acknowledgment, fire and forget. Fastest, no durability guarantee. acks=1: wait for the partition leader to write. Message lost if leader fails before replication. acks=all (or -1): wait for all in-sync replicas to acknowledge. Durability at the cost of latency.

For most production use cases, acks=all with replication factor 3 is the starting point.

Operational Intuition: The Things That Bite You

The concepts above are the foundation. What’s less documented is how Kafka behaves in practice when things go wrong - which is where operational understanding matters.

Consumer lag. Lag is the number of messages a consumer is behind the latest offset. A lag of zero means the consumer is keeping up. Growing lag means it can’t. Kafka itself is fine - it’s just storing messages. The problem is the consumer.

Monitor lag per consumer group per partition. A single partition with runaway lag while others are healthy usually means a message is taking too long to process (or throwing an exception and being retried repeatedly). A group where all partitions are lagging means the consumer needs more instances or the processing logic needs to be faster.

If lag grows unboundedly and the retention period expires before the consumer catches up, you lose events. Size your retention period against your worst-case recovery time.

Rebalancing. A rebalance happens when a consumer joins or leaves a group, or when the partition assignment changes. During a rebalance, consumption pauses across the entire group. If rebalances happen frequently - because consumers are crashing and recovering, or because they’re taking too long to process a batch and getting kicked out by the session timeout - you get significant throughput degradation.

Common cause: processing time exceeds max.poll.interval.ms. The broker assumes the consumer is dead and triggers a rebalance. Fix: tune the poll interval, process faster, or reduce the batch size.

Ordering guarantees. Kafka guarantees ordering within a partition, not across partitions. If you need all events for a given entity to be in order (all events for user ID 42, all events for order ID 123), use the entity ID as the message key. Same key always routes to the same partition, preserving order.

If you use a key, be aware that hot keys create hot partitions. If 20% of your traffic is for a handful of entity IDs, 20% of your traffic goes to a handful of partitions, and those consumers do 20% of the work. Monitor partition-level lag, not just group-level lag.

Offset management in at-least-once delivery. Auto-commit (enable.auto.commit=true) commits offsets on a timer. If your consumer processes messages but crashes before the next auto-commit, the messages get reprocessed after restart. This is usually fine if your processing is idempotent. If it’s not, commit offsets manually after successful processing.

When to Use Kafka

Kafka is the right tool when:

  • You have high-volume event data (thousands of events per second or more)
  • Multiple independent consumers need to process the same events
  • You need replay capability - consumers that can re-read historical events
  • You’re building event sourcing or audit log patterns
  • You need to decouple services without losing events if a consumer is temporarily down

It’s probably the wrong tool when:

  • You need request-reply semantics (use HTTP or gRPC)
  • Your message volume is low (a simple database queue or RabbitMQ is easier to operate)
  • You need complex routing or priority queues (RabbitMQ handles these better)
  • You need messages to be deleted immediately after consumption (Kafka’s log retention adds complexity)

The operational overhead of running Kafka is real. If your use case doesn’t require the specific properties Kafka offers, something simpler will serve you better. But when you need durable, replayable, high-throughput event streaming with independent consumers - Kafka is genuinely the right choice.



Read more