apache-beam - 5

1

Solved

Initializing Apache Beam Test Pipeline in Scala fails

When I try to run a test pipeline it raise an error here is the source code to create the test pipeline: val p: TestPipeline = TestPipeline.create() and here is the error : java.lang.Illeg...

java scala pipeline apache-beam

Emarginate asked 5/8, 2019 at 16:4

1

Solved

TensorFlow Extended (TFX): Clarify Beam, Airflow and Kubeflow usage

I'm hoping someone can clarify the relationship between TensorFlow and its dependencies (Beam, AirFlow, Flink,etc) I'm referencing the main TFX page: https://www.tensorflow.org/tfx/guide#creating_...

python tensorflow apache-flink airflow apache-beam

Jacintajacinth asked 17/5, 2019 at 20:6

0

grpc StatusRuntimeException on Dataflow

I have a dataflow pipeline in which I consume pubsub messages, treat them, and then publish to pubsub. Whenever I have too many calculations (ie I increase the amount of treatment for each message...

google-cloud-platform apache-beam google-cloud-pubsub dataflow

Crosslegged asked 18/7, 2019 at 13:27

1

Solved

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be p...

google-cloud-dataflow apache-beam broadcast dataflow data-processing

Onaonager asked 15/7, 2019 at 21:21

5

How do I write to multiple files in Apache Beam?

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their ...

google-cloud-dataflow apache-beam

Philip asked 8/4, 2017 at 6:46

3

Solved

Apache Beam: DoFn.Setup equivalent in Python SDK

What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python. Is the best way ...

google-cloud-dataflow apache-beam

Coloration asked 28/10, 2018 at 17:40

3

Reading CSV header with Dataflow

I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow. What's the best way to take the header row and ...

google-cloud-dataflow apache-beam

Tetracycline asked 23/12, 2016 at 8:21

1

Solved

When using unbounded PCollection from TextIO to BigQuery, data is stuck in Reshuffle/GroupByKey inside of BigQueryIO

I'm using TextIO for reading from Cloud Storage. As I want to have the job running continously, I use watchForNewFiles. For completeness, the data I read is working fine if I use bounded PCollecti...

google-bigquery apache-beam

Slacks asked 12/11, 2018 at 16:50

2

Solved

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo. How can I do this using the Apache Beam BigQueryIO ...

google-bigquery google-cloud-dataflow apache-beam

Invalidity asked 19/4, 2017 at 20:32

1

Solved

How to flatten multiple Pcollections in python apache beam

How should one implement the following logic located at https://beam.apache.org/documentation/pipelines/design-your-pipeline/: //merge the two PCollections with Flatten//me PCollectionList<Str...

python stream apache-beam

Justen asked 16/5, 2019 at 18:30

1

Solved

How to make a generic Protobuf Parser DoFn in python beam?

Context I'm working with a streaming pipeline which has a protobuf data source in pubsub. I wish to parse this protobuf into a python dict because the data sink requires the input to be a collectio...

python google-cloud-platform protocol-buffers google-cloud-dataflow apache-beam

Slyviasm asked 26/4, 2019 at 13:52

1

Using ToList output as input for AsSingleton or AsList in Apache Beam (python)

I'm getting an unexpected error when I try to use the output of beam.combiners.ToList as the input of beam.pvalue.AsSingleton or beam.pvalue.AsList in order to experiment with side inputs. I was ab...

python-3.x apache-beam

Vivavivace asked 14/4, 2019 at 14:29

2

Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'

SDK: Apache Beam SDK for Go 0.5.0 Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same a...

go google-cloud-dataflow apache-beam

Ambassadress asked 12/12, 2018 at 2:15

1

Solved

How to extract Google PubSub publish time in Apache Beam

My goal is to be able to access PubSub message Publish Time as recorded and set by Google PubSub in Apache Beam (Dataflow). PCollection<PubsubMessage> pubsubMsg = pipeline.apply("Read Mess...

google-cloud-platform google-cloud-dataflow apache-beam

Oh asked 27/3, 2019 at 4:58

2

Solved

Read Avro File and Write it into BigQuery table

My objective is to read the avro file data from Cloud storage and write it to BigQuery table using Java. It would be good if some one provide the code snipet/ideas to read avro format data and writ...

google-bigquery google-cloud-storage google-cloud-dataflow apache-beam

Preeminent asked 5/2, 2019 at 6:48

1

Solved

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings wi...

google-cloud-dataflow apache-beam dataflow

Septarium asked 20/2, 2019 at 10:30

1

Image preprocessing with Dataflow

Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately...

python image-processing google-cloud-dataflow apache-beam

Depredate asked 20/11, 2018 at 15:12

1

Apache Beam -> BigQuery - insertId for deduplication not working

I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho ins...

google-bigquery apache-beam

Evolve asked 25/9, 2017 at 8:53

1

Cloud Dataflow what is the exact definition of freshness and latency?

Problem: When using Cloud Dataflow, we get presented 2 metrics (see this page): system latency data freshness These are also available in Stackdriver under the following names (extract from here)...

google-cloud-dataflow apache-beam

Felly asked 7/3, 2019 at 17:36

2

Solved

Coder issues with Apache Beam and CombineFn

We are building a pipeline using Apache Beam and DirectRunner as the runner. We are currently attempting a simple pipeline whereby we: Pull data from Google Cloud Pub/Sub (currently using the emu...

java google-cloud-platform google-cloud-dataflow apache-beam

Aconite asked 16/5, 2017 at 14:38

2

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this: The first step is doing some basic process...

google-cloud-dataflow apache-beam

Mazel asked 19/2, 2019 at 10:31

1

Solved

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with acc...

google-cloud-dataflow apache-beam dataflow apache-beam-io

Morelos asked 29/1, 2019 at 13:46

2

Watching for new files matching a filepattern in Apache Beam

I have a directory on GCS or another supported filesystem to which new files are being written by an external process. I would like to write an Apache Beam streaming pipeline that continuously wat...

google-cloud-dataflow apache-beam

Abel asked 19/12, 2017 at 23:5

1

Apply TensorFlow Transform to transform/scale features in production

Overview I followed the following guide to write TF Records, where I used tf.Transform to preprocess my features. Now, I would like to deploy my model, for which I need apply this preprocessing fu...

python tensorflow apache-beam tensorflow-serving tensorflow-transform

English asked 7/1, 2019 at 20:54

2

Solved

How to specify insertId when streaming insert to BigQuery using Apache Beam

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensur...

java google-cloud-platform google-bigquery apache-beam apache-beam-io

Dyanne asked 9/1, 2019 at 12:52

apache-beam Questions

Recommended topics

Hot tags