Context
I am reading a file from Google Storage in Beam using a process that looks something like this:
data = pipeline | beam.Create(['gs://my/file.pkl']) | beam.ParDo(LoadFileDoFn)
Where LoadFileDoFn
loads the file and creates a Python list of objects from it, which ParDo
then returns as a PCollection
.
I know I could probably implement a custom source to achieve something similar, but this answer and Beam's own documentation indicate that this approach with pseudo-dataset to read via ParDo
is not uncommon and custom sources may be overkill.
It also works - I get a PCollection
with the correct number of elements, which I can process as I like! However..
Autoscaling problems
The resulting PCollection
does not autoscale at all on Cloud Dataflow. I first have to transform it via:
shuffled_data = data | beam.Shuffle()
I know this answer I also linked above explains pretty much this process - but it doesn't give any insight as to why this is necessary. As far as I can see at Beam's very high level of abstraction, I have a PCollection
with N elements before the shuffle and a similar PCollection
after the shuffle. Why does one scale, but the other not?
The documentation is not very helpful in this case (or in general, but that's another matter). What hidden attribute does the first PCollection
have that prevents it from being distributed to multiple workers that the other doesn't have?