Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.
Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.
I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:
- Pro: Abstraction over different execution backends.
- Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.
Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?
Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).