apache-sparkpysparkdatabricksspark-streamingspark-structured-streaming

Why to use Spark Structured streaming AvailableNow and not just normal batch dataframes?


I'm learning about spark Structured Streaming and things are a little bit nebulous yet...One of the thing that I didn't get is the advantage of using batch mode (AvailableNow=True) over just plain old spark dataframes, without streaming. Could you guys help me understand?


Solution

  • Let's say you are joining pairs of events that arrive within an hour of each other. You can run a batch job every day that covers the last 24 hours, midnight to midnight. But there may be some events at the end of the 24-hour period, whose pair event did not arrive yet before midnight. And there may be some events at the beginning of the 24-hour period whose pair event was in the previous 24-hour period. You will have to add some logic to account for that.

    If you read the events as a stream, you can just collect the pairs of events continually, without worrying about crossing any 24-hour period boundary. You need to keep a maximum of one hour's worth of events in "memory", to combine them as they arrive, so there is a one hour lag before the event pairs get written as they wait for each other. I put "memory" in quotes because they can wait in actual memory, or they can wait in the persisted streaming checkpoint files.

    You can run this streaming job 24x7, but you might not see a lot of traffic and feel this is a waste of resources. So, instead, you can start a streaming job every hour. The job wakes up, reads the incoming event stream for "available now", and incrementally catches up on the work it left off last time. This is when the state will sleep in the streaming checkpoint files between the job invocations. The checkpoint keeps track of the incremental progress.

    This is a key advantage to AvailableNow, IMHO -- the ability to make incremental progress without a constant streaming job running 24x7.

    And the advantage of streaming vs. traditional batch, in this case, is that you avoid the complexity of crossing the 24-hour demarcations (in the case of a batch job that runs once every 24-hours).