apache-beam Questions

1

Solved

When I try to run a test pipeline it raise an error here is the source code to create the test pipeline: val p: TestPipeline = TestPipeline.create() and here is the error : java.lang.Illeg...
Emarginate asked 5/8, 2019 at 16:4

1

Solved

I'm hoping someone can clarify the relationship between TensorFlow and its dependencies (Beam, AirFlow, Flink,etc) I'm referencing the main TFX page: https://www.tensorflow.org/tfx/guide#creating_...
Jacintajacinth asked 17/5, 2019 at 20:6

0

I have a dataflow pipeline in which I consume pubsub messages, treat them, and then publish to pubsub. Whenever I have too many calculations (ie I increase the amount of treatment for each message...
Crosslegged asked 18/7, 2019 at 13:27

1

Solved

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be p...

5

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their ...
Philip asked 8/4, 2017 at 6:46

3

Solved

What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python. Is the best way ...
Coloration asked 28/10, 2018 at 17:40

3

I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow. What's the best way to take the header row and ...
Tetracycline asked 23/12, 2016 at 8:21

1

Solved

I'm using TextIO for reading from Cloud Storage. As I want to have the job running continously, I use watchForNewFiles. For completeness, the data I read is working fine if I use bounded PCollecti...
Slacks asked 12/11, 2018 at 16:50

2

Solved

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo. How can I do this using the Apache Beam BigQueryIO ...
Invalidity asked 19/4, 2017 at 20:32

1

Solved

How should one implement the following logic located at https://beam.apache.org/documentation/pipelines/design-your-pipeline/: //merge the two PCollections with Flatten//me PCollectionList<Str...
Justen asked 16/5, 2019 at 18:30

1

Solved

Context I'm working with a streaming pipeline which has a protobuf data source in pubsub. I wish to parse this protobuf into a python dict because the data sink requires the input to be a collectio...

1

I'm getting an unexpected error when I try to use the output of beam.combiners.ToList as the input of beam.pvalue.AsSingleton or beam.pvalue.AsList in order to experiment with side inputs. I was ab...
Vivavivace asked 14/4, 2019 at 14:29

2

SDK: Apache Beam SDK for Go 0.5.0 Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same a...
Ambassadress asked 12/12, 2018 at 2:15

1

Solved

My goal is to be able to access PubSub message Publish Time as recorded and set by Google PubSub in Apache Beam (Dataflow). PCollection<PubsubMessage> pubsubMsg = pipeline.apply("Read Mess...

2

Solved

My objective is to read the avro file data from Cloud storage and write it to BigQuery table using Java. It would be good if some one provide the code snipet/ideas to read avro format data and writ...

1

Solved

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings wi...
Septarium asked 20/2, 2019 at 10:30

1

Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately...
Depredate asked 20/11, 2018 at 15:12

1

I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho ins...
Evolve asked 25/9, 2017 at 8:53

1

Problem: When using Cloud Dataflow, we get presented 2 metrics (see this page): system latency data freshness These are also available in Stackdriver under the following names (extract from here)...
Felly asked 7/3, 2019 at 17:36

2

Solved

We are building a pipeline using Apache Beam and DirectRunner as the runner. We are currently attempting a simple pipeline whereby we: Pull data from Google Cloud Pub/Sub (currently using the emu...

2

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this: The first step is doing some basic process...
Mazel asked 19/2, 2019 at 10:31

1

Solved

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with acc...
Morelos asked 29/1, 2019 at 13:46

2

I have a directory on GCS or another supported filesystem to which new files are being written by an external process. I would like to write an Apache Beam streaming pipeline that continuously wat...
Abel asked 19/12, 2017 at 23:5

1

Overview I followed the following guide to write TF Records, where I used tf.Transform to preprocess my features. Now, I would like to deploy my model, for which I need apply this preprocessing fu...

2

Solved

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensur...

© 2022 - 2024 — McMap. All rights reserved.