pysparkreal-timenear-real-time

why is spark streaming called near real time?


I know that spark streaming uses micro batches to process the data, but the processing is done in less than a second in some cases. My question is "Can't it be called pure real time processing rather than near real time processing in that senario?"


Solution

  • I'd say that we can only talk about real-time for metrics, alerts and optimization when data is gathered and directly pushed to a dashboard or system, without any kind of ETL process, the purpose of real time is, mainly, the speed.

    Whenever there is a process with batches that extracts historical trending or benchmarking, despite it takes less than a seccond, then is not real-time but is close to it, that's because they talk about near real time.

    So, to answer your question, I'd say that no, is near real time because you are batching and processing.

    I hope it helps.

    Juan