Google Cloud Platform: accumulate data from Pub/Sub to files in Cloud Storage without Dataflow
M

1

6

I'm trying to figure out if there is a service on GCP which would allow consuming a stream from Pub/Sub and dump/batch accumulated data to files in Cloud Storage (e.g. every X minutes). I know that this can be implemented with Dataflow, but looking for more "out of the box" solution, if any exists.

As an example, this is something one can do with AWS Kinesis Firehose - purely on configuration level - one can tell AWS to dump whatever is accumulated in the stream to files on S3, periodically, or when accumulated data reaches some size.

The reason for this is that - when no stream processing is required, but only need to accumulate data - I would like to minimize additional costs of:

  • building a custom piece of software, even a simple one, if it can be avoided completely
  • consuming additional compute resources to execute it

To avoid confusion - I'm not looking for a free of charge solution, but the optimal one.

Monocoque answered 19/10, 2018 at 20:36 Comment(4)
I'm imagining messages arriving via Pub/Sub over time and that you want to batch write those to Cloud Storage objects. In your question, you spoke of aggregated content. Can you expand on the nature of the aggregation of the messages received via Pub/Sub before they are manifested as object data?Alienist
I'm sorry if my description is not comprehensible. I mean more accumulation, not aggregation indeed. What I mean is simply collecting together whatever is in the stream and batching it into files. For example, if every message is a line of text, I would expect having files in gs://, say every N minutes, or every N MiB, with lines of text accumulated in the stream before every dump.Monocoque
Is the Dataflow template for this "out of the box" enough? cloud.google.com/dataflow/docs/templates/… You just fill in some parameters and fire it offCurrency
Hi @ChrisSainty. Thank you, this is really nice indeed. This certainly solves half of the issue (as the template and its implementation is maintained by GCP itself; still, it should run on VM instances dedicated to this pipeline only then). Do you mind posting this as an answer?Monocoque
C
4

Google maintains a set of templates for Dataflow to perform common tasks between their services.

You can use the "Pubsub to Cloud Storage" template by simply plugging in a few config values - https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtogcstext

Currency answered 22/10, 2018 at 18:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.