apache-beam - 3

2

Running an Apache Beam/Google Cloud Dataflow job from a maven-built jar

I have a Google Cloud Dataflow job that I'm running from IntelliJ IDEA using the following command string: compile exec:java -Dexec.mainClass=com.mygroup.mainclass "-Dexec.args=--..." It runs fi...

maven jar google-cloud-dataflow maven-assembly-plugin apache-beam

Nichani asked 27/7, 2017 at 17:12

4

Solved

Apache Airflow or Apache Beam for data processing and job scheduling [closed]

I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has ...

pandas airflow apache-beam

Injured asked 9/5, 2018 at 9:19

2

Solved

java.lang.IllegalStateException: Unable to return a default Coder in dataflow 2.X

I have a simple pipeline in dataflow 2.1 sdk. Which reads data from pubsub then applies a DoFn to it. PCollection<MyClass> e = streamData.apply("ToE", ParDo.of(new MyDoFNClass())); Gettin...

java google-cloud-dataflow apache-beam

Lewak asked 8/12, 2017 at 0:34

3

Solved

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checke...

pandas join google-cloud-dataflow apache-beam

Citron asked 15/2, 2018 at 12:0

2

Solved

Logs for Beam application in Google cloud dataflow

I have a Beam application that runs successfully locally with directrunner and gives me all the log information i have in my code on my local console. But when I tried running it in the google clou...

logging google-cloud-dataflow apache-beam

Exmoor asked 16/9, 2017 at 17:38

1

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G). Spark submit sets driver memory to 30g and executor memory to 16g. However...

java python apache-spark apache-beam

Nature asked 22/10, 2020 at 18:3

3

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot ...

python google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

Cabinetwork asked 19/4, 2018 at 5:7

2

Solved

Dataflow pipeline and pubsub emulator

I'm trying to setup my development environment. Instead of using google cloud pubsub in production, I've been using the pubsub emulator for development and testing. To achieve this I set the follow...

java google-cloud-dataflow apache-beam google-cloud-pubsub google-cloud-pubsub-emulator

Penstock asked 11/4, 2017 at 19:25

1

Experiencing slow streaming writes to BigQuery from Dataflow pipeline?

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23. Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the B...

python google-cloud-dataflow apache-beam

Die asked 9/9, 2020 at 7:28

2

Solved

Optimising GCP costs for a memory-intensive Dataflow Pipeline

We want to improve the costs of running a specific Apache Beam pipeline (Python SDK) in GCP Dataflow. We have built a memory-intensive Apache Beam pipeline, which requires approximately 8.5 GB of R...

google-cloud-platform google-cloud-dataflow apache-beam

Chuff asked 2/9, 2020 at 12:35

0

ModuleNotFoundError: No module named 'main' but path seems correct

The structure of my_dir is ├── README.md ├── main │ ├── functions │ │ ├── __pycache__ │ │ ├── my_function.py │ ├── pipeline.py │ ├── options │ │ └── pipeline_options.py │ └── tr...

python apache-beam

Ambrogio asked 2/9, 2020 at 14:8

0

Converting from Avro schema to Google BigQuery schema in python?

Is there a python library that converts Avro schemas to BigQuery schemas? I noticed that the Java SDK for Apache Beam has a utility that converts from Avro to BigQuery. However, the python SDK for ...

python google-bigquery apache-beam avro

Cementum asked 17/8, 2020 at 16:27

4

How to write to a file name defined at runtime?

I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed?

apache-beam

Yehudi asked 30/1, 2018 at 11:3

2

Solved

Google Cloud Dataflow Worker Threading

Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores?

google-cloud-dataflow apache-beam

Chere asked 12/12, 2017 at 16:48

2

Solved

How to combine two results and pipe it to next step in apache-beam pipeline

See below code snippet, I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively ...

python google-cloud-dataflow apache-beam-io apache-beam

Ungotten asked 23/7, 2020 at 8:38

1

Apache Beam BigQueryIO write slow

My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the Bi...

google-bigquery apache-beam

Wittie asked 10/8, 2017 at 14:59

3

Solved

Failed to construct instance from factory method DataflowRunner#fromOptions in beamSql, apache beam

I'm specifying dataflow runner in my beamSql program below : DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setStagingLocation("gs://gcpbucket...

google-cloud-dataflow apache-beam

Kapoor asked 12/3, 2018 at 2:25

2

Solved

Exception Handling in Apache Beam pipelines using Python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a sim...

python google-cloud-dataflow apache-beam dataflow

Aesthetic asked 29/1, 2019 at 17:0

2

Solved

How do I use MapElements and KV in together in Apache Beam?

I wanted to do something like: PCollection<String> a = whatever; PCollection<KV<String, User>> b = a.apply( MapElements.into(TypeDescriptor.of(KV<String, User>.class)) .v...

java java-8 apache-beam

Moonwort asked 10/11, 2018 at 2:35

2

Solved

Problem in specifying the network in cloud dataflow

I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with...

google-cloud-dataflow apache-beam vpc

Blaine asked 30/7, 2019 at 10:3

2

How to supply parameters to a composite transform in Apache Beam?

I am using the Python SDK of Apache Beam. I have a few transform steps and want to make them reuseable, which points me to write a custom composite transform like this: class MyCompositeTransform...

python apache-beam

Blakney asked 19/12, 2018 at 6:36

2

Dataflow Error: 'Clients have non-trivial state that is local and unpickleable'

I have a pipeline that I can execute locally without any errors. I used to get this error in my locally run pipeline 'Clients have non-trivial state that is local and unpickleable.' PicklingErr...

python google-cloud-dataflow pickle apache-beam

Cyanide asked 30/5, 2018 at 19:9

0

How do we Write a Apache beam pipeline output to a variable instead of a file?

I need to process some values in a data pipeline and need to use the value later somewhere in the program. Here is a simple example import apache_beam as beam p = beam.Pipeline() resu=( p | b...

python python-3.x pipeline apache-beam

Arturo asked 21/3, 2020 at 11:6

2

ClassNotFound exception when attempting to use DataflowRunner

I'm trying to launch a Dataflow job on GCP using Apache Beam 0.6.0. I am compiling an uber jar using the shade plugin because I cannot launch the job using "mvn:execjava". I'm including this depend...

java maven google-cloud-dataflow apache-beam dataflow

Appurtenant asked 21/3, 2017 at 20:13

5

Solved

Kotlin Iterable not supported in Apache Beam?

Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code: @ProcessElement fun processElement( @Element input: KV<String, Iterable<String>>, receiver: Out...

kotlin google-cloud-dataflow apache-beam

Cowitch asked 29/4, 2019 at 18:32

apache-beam Questions

Recommended topics

Hot tags