apache-beam Questions

2

Solved

I am trying to execute a pipeline using Apache Beam but I get an error when trying to put some output tags: import com.google.cloud.Tuple; import com.google.gson.Gson; import com.google.gson.refle...
Hester asked 17/10, 2017 at 16:5

3

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)? Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how b...

2

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example...
Radnorshire asked 29/5, 2018 at 7:16

2

I am trying to read and parse JSON file in Apache Beam code. PipelineOptions options = PipelineOptionsFactory.create(); options.setRunner(SparkRunner.class); Pipeline p = Pipeline.create(options)...
Turban asked 31/5, 2018 at 15:48

1

Solved

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT>. Both of these look si...
Dogcart asked 25/5, 2018 at 9:22

1

Solved

I have a PCollection, and I would like to use a ParDo to filter out some elements from it. Is there a place where I can find an example for this?
Hardden asked 25/5, 2018 at 22:38

1

I'm getting this error pickle.PicklingError: Pickling client objects is explicitly not supported. Clients have non-trivial state that is local and unpickleable. When trying to use beam.ParD...

1

Solved

In apache beam python sdk , I often see '>>' operator in pipeline procedure. https://beam.apache.org/documentation/programming-guide/#pipeline-io lines = p | 'ReadFromText' >> beam.io.ReadF...
Momentarily asked 24/5, 2018 at 23:46

0

What's the best practices when http calls from a DoFn, in a pipeline that will be running in Google Cloud Dataflow? (Java) I mean, if in a pure Java w/o using Beam, I need to think about things li...
Composition asked 14/5, 2018 at 17:2

2

Solved

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the data...

2

Solved

Airflow installation with command is failing sudo pip3 install apache-airflow[gcp_api] Everything was working fine yesterday. Today I see the following error: Could not find a version that sati...

3

I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project. I'm getting the following error: java.sql.SQLException: Cannot c...
Gillie asked 22/6, 2017 at 12:38

1

Solved

What is the difference between these two annotations? DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements. Uses the word "bundle", takes zero arg...
Mills asked 31/8, 2017 at 16:2

1

Solved

We are working on an Apache Beam project (version 2.4.0) where we also want to work with a bucket directly through the google-cloud-storage API. However, combining some of the beam dependencies wit...

1

Solved

So I've read both beam's stateful processing and timely processing articles and had found issues implementing the functions per se. The problem I am trying to solve is something similar to this t...
Preconceive asked 25/4, 2018 at 21:14

1

Solved

One of the things I've noticed is that the performance of BigQueryIO.read().fromQuery() is quite slower than the performance of BigQueryIO.read().from() in Apache Beam. Why does this happen? And is...
Etui asked 18/4, 2018 at 11:21

2

Solved

When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the data...
Wanda asked 4/4, 2018 at 15:20

1

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some hel...
Andresandresen asked 26/3, 2018 at 15:36

1

Solved

I'm having huge performance issues with Datastore write speed. Most of the time it stays under 100 elements/s. I was able to achieve the speeds of around 2600 elements/s when bench marking the wr...

1

Solved

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another...
Industrials asked 27/3, 2018 at 18:22

1

Solved

I'm creating sliding time windows 20 seconds long every 5 seconds from batched log data: rows = p | 'read events' >> beam.io.Read(beam.io.BigQuerySource(query=query)) # set timestamp fie...
Negrito asked 15/9, 2017 at 13:1

2

Solved

How to access the elements of a side input if I have my class extend DoFn? For example: Say I have a ParDo transform like: PCollection<String> data = myData.apply("Get data", ParDo.of(n...
Dipterous asked 2/8, 2017 at 14:1

2

Solved

When using Apache Beam to enrich data, would it be wrong to make an API call for each data item (I'm new to Apache Beam)
Honourable asked 26/7, 2017 at 5:20

2

This question is a follow-up to this one. I am trying to use apache beam to read data from a google spanner table (and then do some data processing). I wrote the following minimum example using the...
Swedenborgianism asked 11/10, 2017 at 9:1

1

Solved

I'm writing a piece of dataflow transform that uses org.apache.beam.sdk.state.MapState to implement caching functionality. However upon introducing MapState, the unit test starts to dysfunction. Th...
Clarettaclarette asked 14/2, 2018 at 22:40

© 2022 - 2024 — McMap. All rights reserved.