apache-beam - 2

3

Solved

Start CloudSQL Proxy on Python Dataflow / Apache Beam

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a...

python google-cloud-dataflow google-cloud-sql apache-beam cloud-sql-proxy

Leaseholder asked 5/6, 2018 at 13:46

1

Solved

Apache Beam Performance Between Python Vs Java Running on GCP Dataflow

We have Beam data pipeline running on GCP dataflow written using both Python and Java. In the beginning, we had some simple and straightforward python beam jobs that works very well. So most recent...

python java google-cloud-platform apache-beam sliding-window

Bronchial asked 20/1, 2022 at 15:50

3

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my de...

google-cloud-dataflow apache-beam

Jair asked 5/4, 2018 at 21:27

1

Solved

Apache Beam Cloud Dataflow Streaming Stuck Side Input

I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data ...

python google-cloud-dataflow apache-beam

Rescript asked 3/1, 2022 at 4:48

2

google dataflow job cost optimization

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard m...

python protocol-buffers google-cloud-dataflow apache-beam avro

Colbert asked 9/1, 2021 at 12:39

1

Go + Apache Beam GCP Dataflow: Could not find the sink for pubsub, Check that the sink library specifies alwayslink = 1

I am using the Go SDK with Apache Beam to build a simple Dataflow pipeline that will get data from a query and publish the data to pub/sub with the following code: package main import ( "con...

go google-cloud-dataflow apache-beam google-cloud-pubsub

Cornstarch asked 20/10, 2021 at 19:0

2

Solved

Dataflow with python flex template - launcher timeout

I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck ...

google-cloud-platform google-cloud-dataflow apache-beam data-pipeline

Kippy asked 13/11, 2020 at 0:14

1

Solved

What is a correct RestrictionT to use for Splittable DoFn reading an unbounded Iterable?

I am writing a Splittable DoFn to read a MongoDB change stream. It allows me to observe events describing changes to a collection, and I can start reading at an arbitrary cluster timestamp I want, ...

java mongodb google-cloud-dataflow apache-beam

Burkhart asked 27/9, 2021 at 9:40

3

Apache Beam - Bigquery streaming insert showing RuntimeException: ManagedChannel allocation site

I am running a streaming Apache beam pipeline in Google Dataflow. It's reading data from Kafka and streaming insert to Bigquery. But in the bigquery streaming insert step it's throwing a large numb...

google-bigquery google-cloud-dataflow apache-beam

Trodden asked 1/6, 2021 at 8:58

2

Python apache beam dataflow worker-startup error: Failed to install packages: failed to install SDK: exit status 2

Before seeing: RuntimeError: IOError: [Errno 2] No such file or directory: '/beam-temp-andrew_mini_vocab-..../......andrew_mini_vocab' [while running .....] in my apache beam python dataflow job...

python python-2.7 google-cloud-platform google-cloud-dataflow apache-beam

Hombre asked 12/12, 2017 at 18:0

1

Dataset was not found in location US

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in lo...

python google-cloud-dataflow apache-beam

Fatuity asked 16/2, 2020 at 10:22

1

Kafka cluster loses or duplicates messages

While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check t...

kubernetes apache-kafka bigdata google-cloud-dataflow apache-beam

Portion asked 12/9, 2019 at 7:26

2

Error with installing apache-beam[gcp] on mac zsh terminal - “zsh: no matches found: apache-beam[gcp]”

I am using zsh, and I have installed gcloud in order to interact with GCP via local terminal on my Mac. I am encountering this error “zsh: no matches found: apache-beam[gcp]”. However, when I run t...

google-cloud-platform google-cloud-dataflow apache-beam

Scaffold asked 11/3, 2020 at 14:21

2

Google Dataflow - Failed to import custom python modules

My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are ava...

python google-cloud-dataflow apache-beam

Tonneau asked 10/7, 2018 at 9:45

3

Apache Beam Counter/Metrics not available in Flink WebUI

I'm using Flink 1.4.1 and Beam 2.3.0, and would like to know is it possible to have metrics available in Flink WebUI (or anywhere at all), as in Dataflow WebUI ? I've used counter like: import or...

java apache-flink metrics apache-beam

Leeward asked 27/2, 2018 at 16:51

1

Solved

How to publish to Pub/Sub from Dataflow in batch (efficiently)?

I want to publish messages to a Pub/Sub topic with some attributes thanks to Dataflow Job in batch mode. My dataflow pipeline is write with python 3.8 and apache-beam 2.27.0 It works with the @Anku...

python-3.x google-cloud-platform google-cloud-dataflow apache-beam google-cloud-pubsub

Bibliotaph asked 26/3, 2021 at 17:21

2

Find cartesian product of 2 lists with Apache Beam

I have 2 PCollections: PCollection<List<String>> ListA = pipeline.apply("getListA", ParDo.of(new getListA())) PCollection<List<String>> ListB = pipeline.apply(&q...

java list collections apache-beam cartesian-product

Prytaneum asked 5/2, 2021 at 15:20

1

Solved

Running Apache Beam python pipelines in Kubernetes

This question might seem like a duplicate of this. I am trying to run Apache Beam python pipeline using flink on an offline instance of Kubernetes. However, since I have user code with external de...

python kubernetes apache-flink apache-beam

Civility asked 26/2, 2020 at 9:48

2

Solved

Write BigQuery results to GCS in CSV format using Apache Beam

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Python. Using beam.io.read(b...

python google-bigquery google-cloud-dataflow apache-beam

Intermixture asked 22/10, 2018 at 12:27

4

Solved

"No filesystem found for scheme gs" when running dataflow in google cloud platform

I am running my google dataflow job in Google Cloud Platform(GCP). When I run this job locally it worked well, but when running it on GCP, I got this error "java.lang.IllegalArgumentException: No...

go google-cloud-platform google-cloud-dataflow apache-beam

Gorcock asked 10/8, 2019 at 0:8

0

Apache Beam Python SDK ReadFromKafka does not receive data

I'm trying out a simple example of reading data off a Kafka topic into Apache Beam. Here's the relevant snippet: with beam.Pipeline(options=pipeline_options) as pipeline: _ = ( pipeline | 'Read...

python apache-kafka apache-beam apache-beam-io

Vasoinhibitor asked 11/2, 2021 at 9:23

2

BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS

My workflow : KAFKA -> Dataflow streaming -> BigQuery Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDesti...

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

Precursory asked 12/3, 2018 at 17:57

3

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes? I have successfully managed to run a Beam Python pipeline using the Portable runner and t...

python kubernetes apache-flink apache-beam

Sailfish asked 9/9, 2019 at 9:22

2

Solved

Dataflow/apache beam - how to access current filename when passing in pattern?

I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since ...

python google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

Tops asked 21/11, 2018 at 2:42

2

Solved

Low parallelism when running Apache Beam wordcount pipeline on Spark with Python SDK

I am quite experienced with Spark cluster configuration and running Pyspark pipelines, but I'm just starting with Beam. So, I am trying to do an apple-to-apple comparison between Pyspark and the Be...

python apache-spark apache-beam

Amie asked 17/11, 2020 at 16:6

apache-beam Questions

Recommended topics

Hot tags