I'm having a problem using kstream joins. What i do is from one topic i seperate 3 different types of messages to new streams. Then do one innerjoin with two of the streams which creates another stream, finally i do a last leftjoin with the new stream and the last remaining stream.
The joins have a window time of 30 seconds.
This is done to filter out some of the messages which are overridden by others.
Im running this application on kubernetes and the disk space for the pods are growing indefinitely until the pod crashes.
I've realized that this is because of the joins store data locally in the tmp/kafka-streams directory.
The directories are called: KSTREAM-JOINTHIS... KSTREAM-OUTEROTHER..
Which stores sst files from rocksDb and these grow indefinitely.
My understanding is since im using a window time of 30 seconds these should be flushed out after the certain time but is not.
I also changed the WINDOW_STORE_CHANGE_LOG_ADDITIONAL_RETENTION_MS_CONFIG to 10 mins to see if that makes a change which is not the case.
I need to understand how this can be changed.
The window size does not determine your storage requirement, but the join's grace period. To handle out-of-order records, data is stored longer than the window size. In newer version, it's required to always specify the grace period via JoinWindows. ofTimeDifferenceAndGrace(...)
. In older versions, you can set grace period via JoinWindows.of(...).grace(...)
-- if not set, it defaults to 24 hours.
The config WINDOW_STORE_CHANGE_LOG_ADDITIONAL_RETENTION_MS_CONFIG
configures how long data is store in the cluster. Thus, you might want to reduce it, too, but it does not help to reduce client side storage requirements.