Skip to content
Go back

Consumer lag in Kafka

Table of contents

Open Table of contents

Introduction

Consumer lag is the difference between the latest offset in a partition and the offset committed by a consumer group.

Lag = Latest Offset - Consumer Offset

Example:

Latest Offset    = 1000
Consumer Offset  = 800
Lag = 200

The consumer is 200 messages behind the producer.

Why Lag Exists

Producer and consumer run independently.

Producer ---> Kafka ---> Consumer

If the producer writes faster than the consumer can process (Producer Rate > Consumer Rate), lag starts increasing. Some amount of lag is normal because Kafka is designed to decouple producers and consumers.

When Is Lag A Problem?

Lag itself is not a problem. Many systems intentionally operate with some lag, such as:

The real concern is when consumers are unable to catch up for an extended period. If lag keeps increasing over time, a bottleneck exists somewhere in the system.

Producer Rate > Consumer Processing Rate

How To Investigate Consumer Lag

The first step is to determine whether the issue originates from Kafka, the consumer application, or a downstream dependency such as a database or external API.

Useful command:

kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--group order-consumer \
--describe

Example output:

TOPIC     PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders    0         1000           5000           4000
orders    1         4990           5000           10
orders    2         5000           5000           0

Notice that only partition 0 has significant lag. This often indicates:

It does not necessarily mean the entire consumer group is unhealthy. Always inspect lag at the partition level before looking at total lag.

Common Causes

1. Consumer Processing Is Slow

A consumer rarely just reads messages. It usually performs additional work such as calling APIs, writing to databases, or invoking other services.

Example:

processMessage(message) {
    callExternalApi();
    saveToDatabase();
}

If the API or database becomes slow, consumer throughput drops and lag starts accumulating.

Fixes:

2. Too Few Consumers

Suppose a topic has 12 partitions but only 2 consumers. Each consumer is responsible for processing 6 partitions.

Fixes:

Adding more consumers than partitions provides no benefit because partitions are the unit of parallelism in Kafka.

3. Rebalancing

Consumers stop processing messages during a rebalance. Frequent rebalancing can cause temporary lag spikes.

Common triggers:

Symptoms:

Lag spike
Consumer group state = REBALANCING

Fixes:

4. Data Skew

A poor partition key can cause uneven traffic distribution across partitions.

Example:

Partition 0 -> 50000 lag
Partition 1 -> 0 lag
Partition 2 -> 0 lag
Partition 3 -> 0 lag

One partition becomes overloaded while the others remain mostly idle.

Fixes:

Metrics To Monitor

Consumer Lag

Represents how many messages the consumer group is behind.

Example:

Lag = 10000 messages

Consumer Throughput

Represents how many messages the consumer can process per second.

Example:

2000 msg/sec

Producer Throughput

Represents how many messages are produced per second.

Example:

5000 msg/sec

If Producer Throughput > Consumer Throughput, lag will continue increasing.

Rebalance Count

Frequent rebalances often indicate consumer instability and can lead to lag spikes.

Lag Recovery Time

Raw lag numbers can be misleading.

Consider a lag of 10000 messages.

If the consumer processes 10000 messages per second, recovery takes approximately 1 second. If the consumer processes only 100 messages per second, recovery takes approximately 100 seconds.

Both scenarios have the same lag but vastly different operational impact.

A useful approximation is:

Recovery Time = Lag / Consumer Throughput

Monitoring recovery time often provides more insight than monitoring lag alone.

Quick Troubleshooting Checklist

When lag starts increasing, check the following:

Most consumer lag incidents are caused by slow consumers or downstream dependencies rather than Kafka itself.

Key Takeaways


Share this post on:

Next Post
Reentrant Locks in Java