apache-beam Questions

2

I have a Google Cloud Dataflow job that I'm running from IntelliJ IDEA using the following command string: compile exec:java -Dexec.mainClass=com.mygroup.mainclass "-Dexec.args=--..." It runs fi...

4

Solved

I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has ...
Injured asked 9/5, 2018 at 9:19

2

Solved

I have a simple pipeline in dataflow 2.1 sdk. Which reads data from pubsub then applies a DoFn to it. PCollection<MyClass> e = streamData.apply("ToE", ParDo.of(new MyDoFNClass())); Gettin...
Lewak asked 8/12, 2017 at 0:34

3

Solved

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checke...
Citron asked 15/2, 2018 at 12:0

2

Solved

I have a Beam application that runs successfully locally with directrunner and gives me all the log information i have in my code on my local console. But when I tried running it in the google clou...
Exmoor asked 16/9, 2017 at 17:38

1

I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G). Spark submit sets driver memory to 30g and executor memory to 16g. However...
Nature asked 22/10, 2020 at 18:3

3

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot ...

2

Solved

I'm trying to setup my development environment. Instead of using google cloud pubsub in production, I've been using the pubsub emulator for development and testing. To achieve this I set the follow...

1

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23. Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the B...
Die asked 9/9, 2020 at 7:28

2

Solved

We want to improve the costs of running a specific Apache Beam pipeline (Python SDK) in GCP Dataflow. We have built a memory-intensive Apache Beam pipeline, which requires approximately 8.5 GB of R...

0

The structure of my_dir is ├── README.md ├── main │   ├── functions │   │   ├── __pycache__ │   │   ├── my_function.py │   ├── pipeline.py │   ├── options │   │   └── pipeline_options.py │   └── tr...
Ambrogio asked 2/9, 2020 at 14:8

0

Is there a python library that converts Avro schemas to BigQuery schemas? I noticed that the Java SDK for Apache Beam has a utility that converts from Avro to BigQuery. However, the python SDK for ...
Cementum asked 17/8, 2020 at 16:27

4

I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed?
Yehudi asked 30/1, 2018 at 11:3

2

Solved

Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores?
Chere asked 12/12, 2017 at 16:48

2

Solved

See below code snippet, I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively ...
Ungotten asked 23/7, 2020 at 8:38

1

My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the Bi...
Wittie asked 10/8, 2017 at 14:59

3

Solved

I'm specifying dataflow runner in my beamSql program below : DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setStagingLocation("gs://gcpbucket...
Kapoor asked 12/3, 2018 at 2:25

2

Solved

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a sim...
Aesthetic asked 29/1, 2019 at 17:0

2

Solved

I wanted to do something like: PCollection<String> a = whatever; PCollection<KV<String, User>> b = a.apply( MapElements.into(TypeDescriptor.of(KV<String, User>.class)) .v...
Moonwort asked 10/11, 2018 at 2:35

2

Solved

I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with...
Blaine asked 30/7, 2019 at 10:3

2

I am using the Python SDK of Apache Beam. I have a few transform steps and want to make them reuseable, which points me to write a custom composite transform like this: class MyCompositeTransform...
Blakney asked 19/12, 2018 at 6:36

2

I have a pipeline that I can execute locally without any errors. I used to get this error in my locally run pipeline 'Clients have non-trivial state that is local and unpickleable.' PicklingErr...
Cyanide asked 30/5, 2018 at 19:9

0

I need to process some values in a data pipeline and need to use the value later somewhere in the program. Here is a simple example import apache_beam as beam p = beam.Pipeline() resu=( p | b...
Arturo asked 21/3, 2020 at 11:6

2

I'm trying to launch a Dataflow job on GCP using Apache Beam 0.6.0. I am compiling an uber jar using the shade plugin because I cannot launch the job using "mvn:execjava". I'm including this depend...
Appurtenant asked 21/3, 2017 at 20:13

5

Solved

Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code: @ProcessElement fun processElement( @Element input: KV<String, Iterable<String>>, receiver: Out...
Cowitch asked 29/4, 2019 at 18:32

© 2022 - 2024 — McMap. All rights reserved.