apache-sparkapache-flinkapache-beam

What are the benefits of Apache Beam over Spark/Flink for batch processing?


Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.

Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.

I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:

Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?

Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).


Solution

  • There's a few things that Beam adds over many of the existing engines.

    Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.