Let's write a simple wordcount job
DataStream<Tuple2<String, Integer>> counts =
text.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1);
(source and other details are irrilevant) Suppose into the pipeline arrives the string
"the cat is on the table"
Result is:
<the - 1>
<cat - 1>
<is - 1>
<on - 1>
<the - 2>
<table - 1>
The only word found twice is "the".
It seems that sum()
function is stateful, mainteing at least the last <word - count> tuple updates when a new tuple <word, 1> arrives (obviosly partitioned by word value).
If it is true, and checkpointing is enabled, is this "state" saved into checkpoint and recovered in case of failures?
Yes it is.
Source code
uses an AggregateFunction
which is stateful