apache-beam Questions

3

Solved

I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of ...
Poult asked 14/8, 2017 at 9:1

0

I'm building some data streaming pipelines that read from Kafka and write to various sinks using Google Cloud Dataflow. The pipeline looks something like this (simplified). // Example pipeline tha...

2

Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__. My DataFlow pipeline imports 2 non-standard modules - one via requirements....
Lobo asked 12/7, 2018 at 17:21

1

Solved

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sens...

1

Solved

We're trying to build a pipeline that takes data from BigQuery, runs through TensorFlow Transform, before training in TensorFlow. The pipeline is up and running, but we're having difficulty with n...

1

Solved

I am trying to implement a CDC in Apache Beam, deployed in Google Cloud Dataflow. I have unloaded the master data and the new data, which is expected to coming daily. The join is not working as e...

2

Solved

I have a very basic Python Dataflow job that reads some data from Pub/Sub, applies a FixedWindow and writes to Google Cloud Storage. transformed = ... transformed | beam.io.WriteToText(known_args....

2

Solved

I try to transfer data from one bigquery to anther through Beam, however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retryi...
Helle asked 29/7, 2019 at 10:1

1

Solved

I am writing a data pipeline in Apache Beam that reads from Pub/Sub, deserializes the message into JSONObjects and pass them to some other pipeline stages. The issue is, when I try to submit my cod...
Seraphina asked 10/12, 2019 at 22:12

1

Solved

EDIT: I got this to work using beam.io.WriteToBigQuery with the sink experimental option turned on. I actually had it on but my issue was I was trying to "build" the full table reference from two v...

1

Solved

In my pipeline I use WriteToBigQuery something like this: | beam.io.WriteToBigQuery( 'thijs:thijsset.thijstable', schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND...

2

Solved

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated i...

2

Solved

I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-...
Twenty asked 17/11, 2019 at 17:28

1

I need some help please. I'm trying to use Apache beam with RabbitMqIO source (version 2.11.0) and AfterWatermark.pastEndOfWindow trigger. It seems like the RabbitMqIO's watermark doesn't advance a...
Cassiecassil asked 17/4, 2019 at 22:4

1

Solved

Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: c...

3

Solved

I'm very newby with GCP and dataflow. However , I would like to start to test and deploy few flows harnessing dataflow on GCP. According to the documentation and everything around dataflow is imper...

1

Solved

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing. Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were n...
Allonym asked 3/10, 2019 at 14:32

2

I am building a pipeline that reads Avro generic records. To pass GenericRecord between stages I need to register AvroCoder. The documentation says that if I use generic record, the schema argument...
Dotson asked 13/12, 2018 at 14:53

1

Solved

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
Sural asked 26/9, 2019 at 6:2

2

Solved

When trying to run a pipeline on the Dataflow service, I specify the staging and temp buckets (in GCS) on the command line. When the program executes, I get a RuntimeException before my pipeline ru...
Alary asked 31/5, 2017 at 17:36

3

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck...
Coinstantaneous asked 4/3, 2019 at 19:39

1

Does Apache Beam 2.12.0 support Java 11, or should i go still stick with a stable Java 8 SDK as for now? I see the site recommends Python 3.5 with Beam 2.12.0 as per the documentation, compared to...
Gal asked 28/8, 2019 at 21:9

2

Solved

I want to use Dataflow to move data from Pub/Sub to GCS. So basically I want Dataflow to accumulate some messages in a fixed amount of time (15 minutes for example), then write those data as text f...

1

I am trying to write a parquet file as follow in Apache Beam using Snappy compression records.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to(options.getOutput())); I se...
Lillalillard asked 29/11, 2018 at 16:28

1

I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these: orderID.json: {"orderID":"test1","orderPacked":"Yes","orderSubmitted...
Damiandamiani asked 7/8, 2019 at 5:52

© 2022 - 2024 — McMap. All rights reserved.