Table of contents
Open Table of contents
Introduction
Consumer lag is the difference between the latest offset in a partition and the offset committed by a consumer group.
Lag = Latest Offset - Consumer Offset
Example:
Latest Offset = 1000
Consumer Offset = 800
Lag = 200
The consumer is 200 messages behind the producer.
Why Lag Exists
Producer and consumer run independently.
Producer ---> Kafka ---> Consumer
If the producer writes faster than the consumer can process (Producer Rate > Consumer Rate), lag starts increasing. Some amount of lag is normal because Kafka is designed to decouple producers and consumers.
When Is Lag A Problem?
Lag itself is not a problem. Many systems intentionally operate with some lag, such as:
- Analytics pipelines
- Batch processing systems
- Report generation workloads
The real concern is when consumers are unable to catch up for an extended period. If lag keeps increasing over time, a bottleneck exists somewhere in the system.
Producer Rate > Consumer Processing Rate
How To Investigate Consumer Lag
The first step is to determine whether the issue originates from Kafka, the consumer application, or a downstream dependency such as a database or external API.
Useful command:
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--group order-consumer \
--describe
Example output:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders 0 1000 5000 4000
orders 1 4990 5000 10
orders 2 5000 5000 0
Notice that only partition 0 has significant lag. This often indicates:
- Data skew
- A hot partition
- A slow consumer instance
It does not necessarily mean the entire consumer group is unhealthy. Always inspect lag at the partition level before looking at total lag.
Common Causes
1. Consumer Processing Is Slow
A consumer rarely just reads messages. It usually performs additional work such as calling APIs, writing to databases, or invoking other services.
Example:
processMessage(message) {
callExternalApi();
saveToDatabase();
}
If the API or database becomes slow, consumer throughput drops and lag starts accumulating.
Fixes:
- Optimize processing logic
- Use batching where possible
- Reduce blocking operations
- Move expensive work asynchronously
- Scale downstream dependencies
2. Too Few Consumers
Suppose a topic has 12 partitions but only 2 consumers. Each consumer is responsible for processing 6 partitions.
Fixes:
- Increase consumer count
- Ensure
Consumers <= Partitions
Adding more consumers than partitions provides no benefit because partitions are the unit of parallelism in Kafka.
3. Rebalancing
Consumers stop processing messages during a rebalance. Frequent rebalancing can cause temporary lag spikes.
Common triggers:
- Consumer crash
- Rolling deployment
- Heartbeat timeout
- New consumer joining the group
- Consumer leaving the group
Symptoms:
Lag spike
Consumer group state = REBALANCING
Fixes:
- Avoid unnecessary deployments
- Tune heartbeat settings
- Investigate unstable consumers
- Use CooperativeStickyAssignor
4. Data Skew
A poor partition key can cause uneven traffic distribution across partitions.
Example:
Partition 0 -> 50000 lag
Partition 1 -> 0 lag
Partition 2 -> 0 lag
Partition 3 -> 0 lag
One partition becomes overloaded while the others remain mostly idle.
Fixes:
- Revisit the partition key
- Increase partition cardinality
- Avoid hot keys
- Distribute traffic more evenly
Metrics To Monitor
Consumer Lag
Represents how many messages the consumer group is behind.
Example:
Lag = 10000 messages
Consumer Throughput
Represents how many messages the consumer can process per second.
Example:
2000 msg/sec
Producer Throughput
Represents how many messages are produced per second.
Example:
5000 msg/sec
If Producer Throughput > Consumer Throughput, lag will continue increasing.
Rebalance Count
Frequent rebalances often indicate consumer instability and can lead to lag spikes.
Lag Recovery Time
Raw lag numbers can be misleading.
Consider a lag of 10000 messages.
If the consumer processes 10000 messages per second, recovery takes approximately 1 second. If the consumer processes only 100 messages per second, recovery takes approximately 100 seconds.
Both scenarios have the same lag but vastly different operational impact.
A useful approximation is:
Recovery Time = Lag / Consumer Throughput
Monitoring recovery time often provides more insight than monitoring lag alone.
Quick Troubleshooting Checklist
When lag starts increasing, check the following:
- Is the consumer alive?
- Is a rebalance occurring?
- Is the database responding slowly?
- Is an external API causing delays?
- Is there data skew?
- Are there enough consumers?
- Are there enough partitions?
- Has producer traffic increased suddenly?
Most consumer lag incidents are caused by slow consumers or downstream dependencies rather than Kafka itself.
Key Takeaways
Lag = Latest Offset - Consumer Offset- Some lag is normal and expected
- Continuously growing lag indicates a bottleneck
- Always investigate lag at the partition level
- Slow consumers and downstream systems are common causes
- Adding consumers only helps if partitions are available
- Recovery time is often more useful than raw lag count
- Most lag issues are application problems rather than Kafka problems