I have a very basic Python Dataflow job that reads some data from Pub/Sub, applies a FixedWindow and writes to Google Cloud Storage.
transformed = ...
transformed | beam.io.WriteToText(known_args.output)
The output is written to the location specific in --output, but only the temporary stage, i.e.
gs://MY_BUCKET/MY_DIR/beam-temp-2a5c0e1eec1c11e8b98342010a800004/...some_UUID...
The file never gets placed into the correctly named location with the sharding template.
Tested on local and DataFlow runner.
When testing further, I have noticed that the streaming_wordcount example has the same issues, however the standard wordcount example writes fine. Perhaps the issues is to with windowing, or reading from pubsub?
It appears WriteToText is not compatible with the streaming source of PubSub. There are likely workarounds, or the Java version may be compatible, but I have opted to use a different solution altogether.
fileio.WriteToFiles
. – Lanie