I was following the book "Kafka: The Definitive Guide" First Edition to understand when log segments are deleted by the broker.
As per the text I understood, a segment will not become eligible for deletion until it is closed. A segment can be closed only when it has reached log.segment.bytes size (considering log.segment.ms is not set) . Once a segment becomes eligible for deletion, the log.retention.ms policy would apply to finally decide when to delete this segment.
However this seems to contradict the behaviour I see in our production cluster ( Kafka ver 2.5).
The log segment gets deleted as soon as log.retention.ms is satisfied, even when the segment size is less than log.segment.bytes.
[2020-12-24 15:51:17,808] INFO [Log partition=Topic-2, dir=/Folder/Kafka_data/kafka] Found deletable segments with base offsets [165828] due to retention time 604800000ms breach (kafka.log.Log)
[2020-12-24 15:51:17,808] INFO [Log partition=Topic-2, dir=/Folder/Kafka_data/kafka] Scheduling segments for deletion List(LogSegment(baseOffset=165828, size=895454171, lastModifiedTime=1608220234000, largestTime=1608220234478)) (kafka.log.Log)
The size is still less than 1GB, but the segment got deleted.
The book mentions at the time of press release the Kafka version was 0.9.0.1 . So was this setting changed in later versions of Kafka. ( I could not find any specific mention of this change in the Kafka docs). Below is the snippet from the book.
Broker Configs: log.retention.ms
and log.retention.bytes
The most common configuration for how long Kafka broker will retain messages (actually, “log segments”) is by time (in ms), and is specified using log.retention.ms
parameter (default: 1 week). If set to -1, no time limit is applied.
Another way to expire is based on the total number of bytes of messages retained. This value is set using the log.retention.bytes
parameter, and it is applied per partition. Its default value is -1, which allows for infinite retention. This means that if you have a topic with 8 partitions, and log.retention.bytes is set to 1 GB, the amount of data retained for the topic will be 8 GB at most. If you have specified both log.retention.bytes
and log.retention.ms
, messages may be removed when either criterion is met.
Broker Configs: log.segment.bytes
and log.roll.ms
As messages are produced to the Kafka broker, they are appended to the current log segment for the partition. Once the log segment has reached the size specified by the log.segment.bytes
parameter (default: 1 GB), the log segment is closed and a new one is opened. Only once a log segment has been closed, it can be considered for expiration (by log.retention.ms
or log.retention.bytes
).
Another way to control when log segments are closed is by using the log.roll.ms
parameter (default: 1 week), which specifies the amount of time after which a log segment should be closed. Kafka will close a log segment either when the size limit is reached or when the time limit is reached, whichever comes first.
A smaller log-segment size means that files must be closed and allocated more often, which reduces the overall efficiency of disk writes. Adjusting the size of the log segment can be important if topics have a low produce rate. For example, if a topic receives only 100 megabytes per day of messages, and log.segment.bytes
is set to the default, it will take 10 days to fill one segment. As messages cannot be expired until the log segment is closed, if log.retention.ms
is set to 1 week, they will actually be up to 17 days of messages retained until the closed segment expires. This is because once the log segment is closed with the current 10 days of messages, that log segment must be retained 7 days before it expires based on the time policy.
Note on Topic Configs: Both retention times as well as segment roll-over behavior can also be overridden by topic properties. The names of these topic properties are slightly different: retention.ms
, retention.bytes
, segment.bytes
, and segment.ms
.