apache-beam Questions

3

Solved

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a...

1

Solved

We have Beam data pipeline running on GCP dataflow written using both Python and Java. In the beginning, we had some simple and straightforward python beam jobs that works very well. So most recent...
Bronchial asked 20/1, 2022 at 15:50

3

I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my de...
Jair asked 5/4, 2018 at 21:27

1

Solved

I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data ...
Rescript asked 3/1, 2022 at 4:48

2

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard m...

1

I am using the Go SDK with Apache Beam to build a simple Dataflow pipeline that will get data from a query and publish the data to pub/sub with the following code: package main import ( "con...
Cornstarch asked 20/10, 2021 at 19:0

2

Solved

I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck ...

1

Solved

I am writing a Splittable DoFn to read a MongoDB change stream. It allows me to observe events describing changes to a collection, and I can start reading at an arbitrary cluster timestamp I want, ...
Burkhart asked 27/9, 2021 at 9:40

3

I am running a streaming Apache beam pipeline in Google Dataflow. It's reading data from Kafka and streaming insert to Bigquery. But in the bigquery streaming insert step it's throwing a large numb...
Trodden asked 1/6, 2021 at 8:58

2

Before seeing: RuntimeError: IOError: [Errno 2] No such file or directory: '/beam-temp-andrew_mini_vocab-..../......andrew_mini_vocab' [while running .....] in my apache beam python dataflow job...

1

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in lo...
Fatuity asked 16/2, 2020 at 10:22

1

While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check t...

2

I am using zsh, and I have installed gcloud in order to interact with GCP via local terminal on my Mac. I am encountering this error “zsh: no matches found: apache-beam[gcp]”. However, when I run t...
Scaffold asked 11/3, 2020 at 14:21

2

My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are ava...
Tonneau asked 10/7, 2018 at 9:45

3

I'm using Flink 1.4.1 and Beam 2.3.0, and would like to know is it possible to have metrics available in Flink WebUI (or anywhere at all), as in Dataflow WebUI ? I've used counter like: import or...
Leeward asked 27/2, 2018 at 16:51

1

Solved

I want to publish messages to a Pub/Sub topic with some attributes thanks to Dataflow Job in batch mode. My dataflow pipeline is write with python 3.8 and apache-beam 2.27.0 It works with the @Anku...

2

I have 2 PCollections: PCollection<List<String>> ListA = pipeline.apply("getListA", ParDo.of(new getListA())) PCollection<List<String>> ListB = pipeline.apply(&q...
Prytaneum asked 5/2, 2021 at 15:20

1

Solved

This question might seem like a duplicate of this. I am trying to run Apache Beam python pipeline using flink on an offline instance of Kubernetes. However, since I have user code with external de...
Civility asked 26/2, 2020 at 9:48

2

Solved

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Python. Using beam.io.read(b...
Intermixture asked 22/10, 2018 at 12:27

4

Solved

I am running my google dataflow job in Google Cloud Platform(GCP). When I run this job locally it worked well, but when running it on GCP, I got this error "java.lang.IllegalArgumentException: No...

0

I'm trying out a simple example of reading data off a Kafka topic into Apache Beam. Here's the relevant snippet: with beam.Pipeline(options=pipeline_options) as pipeline: _ = ( pipeline | 'Read...
Vasoinhibitor asked 11/2, 2021 at 9:23

2

My workflow : KAFKA -> Dataflow streaming -> BigQuery Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDesti...

3

Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes? I have successfully managed to run a Beam Python pipeline using the Portable runner and t...
Sailfish asked 9/9, 2019 at 9:22

2

Solved

I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since ...

2

Solved

I am quite experienced with Spark cluster configuration and running Pyspark pipelines, but I'm just starting with Beam. So, I am trying to do an apple-to-apple comparison between Pyspark and the Be...
Amie asked 17/11, 2020 at 16:6

© 2022 - 2024 — McMap. All rights reserved.