Consumer lag in Kafka

Open Table of contents

Introduction
Why Lag Exists
When Is Lag A Problem?
How To Investigate Consumer Lag
Common Causes
Metrics To Monitor
Lag Recovery Time
Quick Troubleshooting Checklist
Key Takeaways

Introduction

Consumer lag is the difference between the latest offset in a partition and the offset committed by a consumer group.

Lag = Latest Offset - Consumer Offset

Example:

Latest Offset    = 1000
Consumer Offset  = 800
Lag = 200

The consumer is 200 messages behind the producer.

Why Lag Exists

Producer and consumer run independently.

Producer ---> Kafka ---> Consumer

If the producer writes faster than the consumer can process (Producer Rate > Consumer Rate), lag starts increasing. Some amount of lag is normal because Kafka is designed to decouple producers and consumers.

When Is Lag A Problem?

Lag itself is not a problem. Many systems intentionally operate with some lag, such as:

Analytics pipelines
Batch processing systems
Report generation workloads

The real concern is when consumers are unable to catch up for an extended period. If lag keeps increasing over time, a bottleneck exists somewhere in the system.

Producer Rate > Consumer Processing Rate

How To Investigate Consumer Lag

The first step is to determine whether the issue originates from Kafka, the consumer application, or a downstream dependency such as a database or external API.

Useful command:

kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--group order-consumer \
--describe

Example output:

TOPIC     PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders    0         1000           5000           4000
orders    1         4990           5000           10
orders    2         5000           5000           0

Notice that only partition 0 has significant lag. This often indicates:

Data skew
A hot partition
A slow consumer instance

It does not necessarily mean the entire consumer group is unhealthy. Always inspect lag at the partition level before looking at total lag.

Common Causes

1. Consumer Processing Is Slow

A consumer rarely just reads messages. It usually performs additional work such as calling APIs, writing to databases, or invoking other services.

Example:

processMessage(message) {
    callExternalApi();
    saveToDatabase();
}

If the API or database becomes slow, consumer throughput drops and lag starts accumulating.

Fixes:

Optimize processing logic
Use batching where possible
Reduce blocking operations
Move expensive work asynchronously
Scale downstream dependencies

2. Too Few Consumers

Suppose a topic has 12 partitions but only 2 consumers. Each consumer is responsible for processing 6 partitions.

Fixes:

Increase consumer count
Ensure Consumers <= Partitions

Adding more consumers than partitions provides no benefit because partitions are the unit of parallelism in Kafka.

3. Rebalancing

Consumers stop processing messages during a rebalance. Frequent rebalancing can cause temporary lag spikes.

Common triggers:

Consumer crash
Rolling deployment
Heartbeat timeout
New consumer joining the group
Consumer leaving the group

Symptoms:

Lag spike
Consumer group state = REBALANCING

Fixes:

Avoid unnecessary deployments
Tune heartbeat settings
Investigate unstable consumers
Use CooperativeStickyAssignor

4. Data Skew

A poor partition key can cause uneven traffic distribution across partitions.

Example:

Partition 0 -> 50000 lag
Partition 1 -> 0 lag
Partition 2 -> 0 lag
Partition 3 -> 0 lag

One partition becomes overloaded while the others remain mostly idle.

Fixes:

Revisit the partition key
Increase partition cardinality
Avoid hot keys
Distribute traffic more evenly

Metrics To Monitor

Consumer Lag

Represents how many messages the consumer group is behind.

Example:

Lag = 10000 messages

Consumer Throughput

Represents how many messages the consumer can process per second.

Example:

2000 msg/sec

Producer Throughput

Represents how many messages are produced per second.

Example:

5000 msg/sec

If Producer Throughput > Consumer Throughput, lag will continue increasing.

Rebalance Count

Frequent rebalances often indicate consumer instability and can lead to lag spikes.

Lag Recovery Time

Raw lag numbers can be misleading.

Consider a lag of 10000 messages.

If the consumer processes 10000 messages per second, recovery takes approximately 1 second. If the consumer processes only 100 messages per second, recovery takes approximately 100 seconds.

Both scenarios have the same lag but vastly different operational impact.

A useful approximation is:

Recovery Time = Lag / Consumer Throughput

Monitoring recovery time often provides more insight than monitoring lag alone.

Quick Troubleshooting Checklist

When lag starts increasing, check the following:

Is the consumer alive?
Is a rebalance occurring?
Is the database responding slowly?
Is an external API causing delays?
Is there data skew?
Are there enough consumers?
Are there enough partitions?
Has producer traffic increased suddenly?

Most consumer lag incidents are caused by slow consumers or downstream dependencies rather than Kafka itself.

Key Takeaways

Lag = Latest Offset - Consumer Offset
Some lag is normal and expected
Continuously growing lag indicates a bottleneck
Always investigate lag at the partition level
Slow consumers and downstream systems are common causes
Adding consumers only helps if partitions are available
Recovery time is often more useful than raw lag count
Most lag issues are application problems rather than Kafka problems