apache-beam - 7

2

Solved

Apache Beam - Unable to infer a Coder on a DoFn with multiple output tags

I am trying to execute a pipeline using Apache Beam but I get an error when trying to put some output tags: import com.google.cloud.Tuple; import com.google.gson.Gson; import com.google.gson.refle...

java-8 google-cloud-dataflow apache-beam

Hester asked 17/10, 2017 at 16:5

3

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)? Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how b...

apache-spark kubernetes apache-spark-sql google-cloud-dataflow apache-beam

Tarryn asked 28/6, 2018 at 3:51

2

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example...

java google-cloud-dataflow apache-beam sliding-window

Radnorshire asked 29/5, 2018 at 7:16

2

Beam json parsing

I am trying to read and parse JSON file in Apache Beam code. PipelineOptions options = PipelineOptionsFactory.create(); options.setRunner(SparkRunner.class); Pipeline p = Pipeline.create(options)...

java json apache-beam

Turban asked 31/5, 2018 at 15:48

1

Solved

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT>. Both of these look si...

java apache-beam

Dogcart asked 25/5, 2018 at 9:22

1

Solved

How do I Filter elements of a PCollection with a ParDo with Apache Beam Python SDK

I have a PCollection, and I would like to use a ParDo to filter out some elements from it. Is there a place where I can find an example for this?

google-cloud-dataflow apache-beam

Hardden asked 25/5, 2018 at 22:38

1

Using start_bundle() in apache-beam job not working. Unpickleable storage.Client()

I'm getting this error pickle.PicklingError: Pickling client objects is explicitly not supported. Clients have non-trivial state that is local and unpickleable. When trying to use beam.ParD...

google-cloud-storage google-cloud-dataflow apache-beam

Tact asked 24/5, 2018 at 19:17

1

Solved

What does the redirection mean in apache beam (python)

In apache beam python sdk , I often see '>>' operator in pipeline procedure. https://beam.apache.org/documentation/programming-guide/#pipeline-io lines = p | 'ReadFromText' >> beam.io.ReadF...

python apache-beam

Momentarily asked 24/5, 2018 at 23:46

0

Best Practices in Http Calls in Cloud Dataflow - Java

What's the best practices when http calls from a DoFn, in a pipeline that will be running in Google Cloud Dataflow? (Java) I mean, if in a pure Java w/o using Beam, I need to think about things li...

google-cloud-dataflow apache-beam

Composition asked 14/5, 2018 at 17:2

2

Solved

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the data...

google-cloud-datastore google-cloud-dataflow apache-beam apache-beam-io vcf-variant-call-format

Florencio asked 7/5, 2018 at 1:15

2

Solved

Airflow installation failure beam[gcp]

Airflow installation with command is failing sudo pip3 install apache-airflow[gcp_api] Everything was working fine yesterday. Today I see the following error: Could not find a version that sati...

google-cloud-platform google-cloud-dataflow airflow apache-beam

Jazmin asked 1/3, 2018 at 10:35

3

Connecting to Cloud SQL from Dataflow Job

I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project. I'm getting the following error: java.sql.SQLException: Cannot c...

google-cloud-sql google-cloud-dataflow apache-beam

Gillie asked 22/6, 2017 at 12:38

1

Solved

What is the difference between DoFn.Setup and DoFn.StartBundle?

What is the difference between these two annotations? DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements. Uses the word "bundle", takes zero arg...

java apache-beam

Mills asked 31/8, 2017 at 16:2

1

Solved

How to use google-cloud-storage directly in a Apache Beam project

We are working on an Apache Beam project (version 2.4.0) where we also want to work with a bucket directly through the google-cloud-storage API. However, combining some of the beam dependencies wit...

java google-cloud-storage google-cloud-dataflow apache-beam

Roue asked 25/4, 2018 at 9:33

1

Solved

Issues with Stateful processing in Apache Beam

So I've read both beam's stateful processing and timely processing articles and had found issues implementing the functions per se. The problem I am trying to solve is something similar to this t...

java state google-cloud-dataflow apache-beam

Preconceive asked 25/4, 2018 at 21:14

1

Solved

BigQueryIO.read().fromQuery performance slow

One of the things I've noticed is that the performance of BigQueryIO.read().fromQuery() is quite slower than the performance of BigQueryIO.read().from() in Apache Beam. Why does this happen? And is...

google-bigquery google-cloud-dataflow apache-beam

Etui asked 18/4, 2018 at 11:21

2

Solved

Invalid GCS URI used for staging location

When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the data...

google-cloud-dataflow apache-beam

Wanda asked 4/4, 2018 at 15:20

1

How to create groups of N elements from a PCollection Apache Beam Python

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some hel...

python google-cloud-dataflow apache-beam dataflow

Andresandresen asked 26/3, 2018 at 15:36

1

Solved

Datastore poor performance with Apache Beam & Dataflow

I'm having huge performance issues with Datastore write speed. Most of the time it stays under 100 elements/s. I was able to achieve the speeds of around 2600 elements/s when bench marking the wr...

java google-cloud-platform google-cloud-datastore google-cloud-dataflow apache-beam

Edla asked 22/1, 2018 at 17:27

1

Solved

Controlling Dataflow/Apache Beam output sharding

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another...

python google-cloud-dataflow apache-beam

Industrials asked 27/3, 2018 at 18:22

1

Solved

How to get the end of window timestamp in Apache Beam Python

I'm creating sliding time windows 20 seconds long every 5 seconds from batched log data: rows = p | 'read events' >> beam.io.Read(beam.io.BigQuerySource(query=query)) # set timestamp fie...

python google-cloud-dataflow apache-beam

Negrito asked 15/9, 2017 at 13:1

2

Solved

Access side input in a non - anonymous DoFn

How to access the elements of a side input if I have my class extend DoFn? For example: Say I have a ParDo transform like: PCollection<String> data = myData.apply("Get data", ParDo.of(n...

google-cloud-dataflow apache-beam

Dipterous asked 2/8, 2017 at 14:1

2

Solved

Is it against the Apache Beam Programming Model to Invoke an API?

When using Apache Beam to enrich data, would it be wrong to make an API call for each data item (I'm new to Apache Beam)

apache-beam

Honourable asked 26/7, 2017 at 5:20

2

Error using SpannerIO in apache beam

This question is a follow-up to this one. I am trying to use apache beam to read data from a google spanner table (and then do some data processing). I wrote the following minimum example using the...

java google-cloud-dataflow apache-beam google-cloud-spanner

Swedenborgianism asked 11/10, 2017 at 9:1

1

Solved

What is the current best practice for testing a DoFn that use MapState

I'm writing a piece of dataflow transform that uses org.apache.beam.sdk.state.MapState to implement caching functionality. However upon introducing MapState, the unit test starts to dysfunction. Th...

java google-cloud-platform google-cloud-dataflow apache-beam

Clarettaclarette asked 14/2, 2018 at 22:40

apache-beam Questions

Recommended topics

Hot tags