apache-flinkflink-streamingflink-stateflink-stateful-functions

Is Flink KeyedStream sum() function stateful?


Let's write a simple wordcount job

DataStream<Tuple2<String, Integer>> counts =
        text.flatMap(new Tokenizer())
                .keyBy(value -> value.f0)
                .sum(1);

(source and other details are irrilevant) Suppose into the pipeline arrives the string

"the cat is on the table"

Result is:

<the - 1>
<cat - 1>
<is - 1>
<on - 1>
<the - 2>
<table - 1>

The only word found twice is "the". It seems that sum() function is stateful, mainteing at least the last <word - count> tuple updates when a new tuple <word, 1> arrives (obviosly partitioned by word value).
If it is true, and checkpointing is enabled, is this "state" saved into checkpoint and recovered in case of failures?


Solution

  • Yes it is.

    Source code

    https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/aggregation/SumAggregator.java

    uses an AggregateFunction

    https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/common/functions/AggregateFunction.html

    which is stateful