apache-sparkbigdataapache-storm

Aggregations on Realtime Streaming data


Can someone explain how does aggregations are performed on real time streaming data using big data technologies like storm, spark, etc.. It's non meaningful to calculate on steaming data, as data keeps on flowing


Solution

  • Most of streaming frameworks support 'window' which collects tuples (events) in a window and presents it to be aggregated. Tumbling window and sliding window are widely supported, and the units of window are count (tuples) and time.

    You can refer below links to get an idea of concepts for window:

    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

    You can calculate aggregation on tuples for latest N mins(may be seconds, hours, or so) via windowing. You may feel the operation as batching, and yes you could also do it via pushing tuples to external storage and do some aggregations with batch frameworks.

    In normal, aggregation in batch frameworks will work more efficiently (the aggregation operation is batch oriented), but aggregation on streaming framework on the fly doesn't require external storage (if the window fits in memory) and also doesn't require additional batch frameworks to do so.