apache-flinkflink-streamingpyflink

Performance difference between Table- and DataStream-API


Let us assume that I have two operations which I can write easily using both APIs in PyFlink (e.g. a sum of a column over a TumblingWindow). Are there any performance differences when I use the predefined Table-API commands vs manually implementing the count in Python as a ProcessWindowFunction?

To be more precise, I want to compare

table_from_stream \
    .window(Tumble.over(lit(15).minutes)).on(col('time')).alias('w'))
    .groupby(col('w'), col('a'))
    .select(col('w').end, col('a'), col('b').sum)

vs

datastream \
    .key_by('a') \
    .window(TumblingEventTimeWindows.of(Time.minutes(15))) \ 
    .process(MyProcessFunctionThatManuallySums)

Solution

  • The version using the Table API will be more efficient because