apache-beam Questions
2
I have a Google Cloud Dataflow job that I'm running from IntelliJ IDEA using the following command string:
compile exec:java -Dexec.mainClass=com.mygroup.mainclass "-Dexec.args=--..."
It runs fi...
Nichani asked 27/7, 2017 at 17:12
4
Solved
I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has ...
Injured asked 9/5, 2018 at 9:19
2
Solved
I have a simple pipeline in dataflow 2.1 sdk. Which reads data from pubsub then applies a DoFn to it.
PCollection<MyClass> e = streamData.apply("ToE", ParDo.of(new MyDoFNClass()));
Gettin...
Lewak asked 8/12, 2017 at 0:34
3
Solved
How to implement Pandas in Apache beam ?
I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checke...
Citron asked 15/2, 2018 at 12:0
2
Solved
I have a Beam application that runs successfully locally with directrunner and gives me all the log information i have in my code on my local console. But when I tried running it in the google clou...
Exmoor asked 16/9, 2017 at 17:38
1
(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings
I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G). Spark submit sets driver memory to 30g and executor memory to 16g. However...
Nature asked 22/10, 2020 at 18:3
3
Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot ...
Cabinetwork asked 19/4, 2018 at 5:7
2
Solved
I'm trying to setup my development environment. Instead of using google cloud pubsub in production, I've been using the pubsub emulator for development and testing. To achieve this I set the follow...
Penstock asked 11/4, 2017 at 19:25
1
I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23.
Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the B...
Die asked 9/9, 2020 at 7:28
2
Solved
We want to improve the costs of running a specific Apache Beam pipeline (Python SDK) in GCP Dataflow.
We have built a memory-intensive Apache Beam pipeline, which requires approximately 8.5 GB of R...
Chuff asked 2/9, 2020 at 12:35
0
The structure of my_dir is
├── README.md
├── main
│ ├── functions
│ │ ├── __pycache__
│ │ ├── my_function.py
│ ├── pipeline.py
│ ├── options
│ │ └── pipeline_options.py
│ └── tr...
Ambrogio asked 2/9, 2020 at 14:8
0
Is there a python library that converts Avro schemas to BigQuery schemas?
I noticed that the Java SDK for Apache Beam has a utility that converts from Avro to BigQuery. However, the python SDK for ...
Cementum asked 17/8, 2020 at 16:27
4
I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed?
Yehudi asked 30/1, 2018 at 11:3
2
Solved
Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores?
Chere asked 12/12, 2017 at 16:48
2
Solved
See below code snippet,
I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively
...
Ungotten asked 23/7, 2020 at 8:38
1
My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the Bi...
Wittie asked 10/8, 2017 at 14:59
3
Solved
I'm specifying dataflow runner in my beamSql program below :
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setStagingLocation("gs://gcpbucket...
Kapoor asked 12/3, 2018 at 2:25
2
Solved
I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows.
On a sim...
Aesthetic asked 29/1, 2019 at 17:0
2
Solved
I wanted to do something like:
PCollection<String> a = whatever;
PCollection<KV<String, User>> b = a.apply(
MapElements.into(TypeDescriptor.of(KV<String, User>.class))
.v...
Moonwort asked 10/11, 2018 at 2:35
2
Solved
I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with...
Blaine asked 30/7, 2019 at 10:3
2
I am using the Python SDK of Apache Beam.
I have a few transform steps and want to make them reuseable, which points me to write a custom composite transform like this:
class MyCompositeTransform...
Blakney asked 19/12, 2018 at 6:36
2
I have a pipeline that I can execute locally without any errors. I used to get this error in my locally run pipeline
'Clients have non-trivial state that is local and unpickleable.'
PicklingErr...
Cyanide asked 30/5, 2018 at 19:9
0
I need to process some values in a data pipeline and need to use the value later somewhere in the program.
Here is a simple example
import apache_beam as beam
p = beam.Pipeline()
resu=(
p
| b...
Arturo asked 21/3, 2020 at 11:6
2
I'm trying to launch a Dataflow job on GCP using Apache Beam 0.6.0. I am compiling an uber jar using the shade plugin because I cannot launch the job using "mvn:execjava". I'm including this depend...
Appurtenant asked 21/3, 2017 at 20:13
5
Solved
Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code:
@ProcessElement
fun processElement(
@Element input: KV<String, Iterable<String>>, receiver: Out...
Cowitch asked 29/4, 2019 at 18:32
© 2022 - 2024 — McMap. All rights reserved.