apache-sparkpysparkspark-streamingspark-structured-streaming

How does spark structured streaming job handle stream - static DataFrame join?


I have a spark structured streaming job which reads a mapping table from cassandra and deltalake and joins with streaming df. I would like to understand the exact mechanism here. Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch? If that is the case i see in spark web ui that these tables are read only once. Please help me understand this. Thanks in advance


Solution

  • "Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch?"

    According to the book "Learning Spark, 2nd edition" from O'Reilly on static-stream joins it is mentioned that the static DataFrame is read in every micro-batch.

    To be more precise, I find the following section in the book quite helpful:

    When applying a "static-stream" join it is assumed that the static part is not changing at all or only slowly changing. If you plan to join two rapidly changing data sources it is required to switch to a "stream-stream" join.

    In case you want to regularly refresh of the “static“ data you may check my answer in How to refresh static dataframe periodically.