apache-beam Questions

2

SDK: Apache Beam SDK for Go 0.5.0 We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes m...
Kelson asked 17/12, 2018 at 22:7

1

Solved

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationall...
Lungwort asked 10/1, 2019 at 3:39

1

Solved

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using be...
Hoff asked 24/12, 2018 at 11:35

1

This is most similar to this question. I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a ve...
Foxworth asked 27/11, 2017 at 19:9

1

Solved

Context I am reading a file from Google Storage in Beam using a process that looks something like this: data = pipeline | beam.Create(['gs://my/file.pkl']) | beam.ParDo(LoadFileDoFn) Where LoadFil...
Hardened asked 21/12, 2018 at 13:6

2

(I've also raised a GitHub issue for this - https://github.com/googleapis/google-cloud-java/issues/4095) I have the latest versions of the following 2 dependencies for Apache Beam: Dependency...

1

I have a use case where I initialise a HashMap that contains a set of lookup data (information about the physical location etc. of IoT devices). This lookup data serves as reference data for a 2nd ...

3

Solved

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). When specifying the worker_machine_type parameter as e.g. n1-highmem-2 or custom-1-6656, Dataflow runs the job but always uses t...
Degraded asked 12/4, 2018 at 7:35

1

I Have a PCollection of Object that I get from pubsub, let say : PCollection<Student> pStudent ; and in student attributes, there is an attribute let say studentID; and I want to read at...
Dagon asked 8/11, 2018 at 6:22

1

Solved

Currently I am using the gcs-text-to-bigquery google provided template and feeding in a transform function to transform my jsonl file. The jsonl is pretty nested and i wanted to be able to output m...

1

Working on reading files from multiple folders and then output the file contents with the file name like (filecontents, filename) to bigquery in apache beam using the python sdk and a dataflow runn...

2

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing? I am trying to grasp the technical and programmatic differences as well. Please help me understand ...

1

Solved

I'm new to Apache Beam, and I want to calculate the mean and std deviation over a large dataset. Given a .csv file of the form "A,B" where A, B are ints, this is basically what I have. import apa...
Glyceryl asked 13/8, 2018 at 21:38

2

Solved

I am trying to read a table from a Google spanner database, and write it to a text file to do a backup, using google dataflow with the python sdk. I have written the following script: from __fut...

2

I'm reading an article on exactly-once processing implemented by some Dataflow sources and sinks and I'm having troubles understanding the example on BigQuery sink. From the article Generating a...
Bamford asked 26/9, 2018 at 14:52

1

I fetch protobuf data from google pub/sub and deserialize the data to Message type object. So i get PCollection<Message> type object. Here is sample code: public class ProcessPubsubMessage e...
Whoopee asked 18/9, 2018 at 9:4

2

Solved

Input PCollection is http requests, which is a bounded dataset. I want to make async http call (Java) in a ParDo , parse response and put results into output PCollection. My code is below. Getting ...
Scraperboard asked 17/4, 2018 at 18:20

1

Solved

I'm creating a shell script to handle automation for some of our workflows, This workflow include accessing Google Buckets via Apache Beam GCP. I'm using a .json file with my service account, in w...
Worrell asked 17/9, 2018 at 22:48

1

As the documentation is only available for JAVA, I could not really understand what it means. It states - "While ParDo always produces a main output PCollection (as the return value from apply), y...
Swee asked 14/9, 2018 at 20:9

1

Solved

Given the data set as below {"slot":"reward","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42544} {"slot":"reward_dlg","result":1,"rank":1,"isLandscape"...
Ascariasis asked 11/9, 2018 at 7:21

1

Solved

I have pulled down a copy of the Pub/Sub to BigQuery Dataflow template from Google's github repository. I am running it on my local machine using the direct-runner. In testing I confirmed that the...

2

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase. I followed this StackOverflow post and implemented the initial conve...
Institution asked 31/7, 2018 at 5:40

1

Solved

I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in ...
Tavie asked 9/8, 2018 at 9:27

1

Solved

When trying to load a GCS file using a CSEK I get a dataflow error [ERROR] The target object is encrypted by a customer-supplied encryption key I was going to try to AES decrypt on the dataflow ...

2

I want to find out only female employees out of the two different JSON files and select only the fields which we are interested in and write the output into another JSON. Also I am trying to imple...
Cyruscyst asked 24/7, 2017 at 18:24

© 2022 - 2024 — McMap. All rights reserved.