apache-flinkflink-streamingflink-statefun

Need advice on migrating from Flink DataStream Job to Flink Stateful Functions 3.1


I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.

The functions of my current Flink Job are:

  1. Read message from Kafka
  2. Each message is in format a slice of data packets, e.g.(s for slice):
    • s-0, s-1 are for packet 0
    • s-4, s-5, s-6 are for packet 1
  3. The job merges slices into several data packets and then sink packets to HBase
  4. Window functions are applied to deal with disorder of slice arrival

My Objectives

My current plan

I have read the doc and got some ideas. My plans are:

My Questions

I don't feel confident with my plan. Is there anything wrong with my understandings/plan?

Are there any best practice I should refer to?

Update:

windows were used to assemble results

  1. get a slice, inspect its metadata and know it is the last one of the packet
  2. also knows the packet should contains 10 slices
  3. if there are already 10 slices, merge them
  4. if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.

I want to get rid of windows during the rewrite, but I don't know how


Solution

  • Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events

    With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.

    In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.

    Doing this with the Statefun API

    You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer: