apache-beam Questions

3

I'm wanting to run a continuous stream processing job using Beam on a Flink runner within Kubernetes. I've been following this tutorial here (https://python.plainenglish.io/apache-beam-flink-cluste...
Twayblade asked 18/7, 2023 at 12:8

2

Solved

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows: Pull data from pub/sub Deserialize JSON into Java object Window events w/ fixed win...
Deploy asked 16/5, 2017 at 21:23

2

I am trying to run a apache beam pipeline in Dataflow runner; The job reads data from a bigquery table and write data to a database. I am running the job with classic template option in dataflow - ...
Concepcion asked 7/5, 2021 at 20:29

6

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder? My project structure is like this: ├── pipeline...
Myriammyriameter asked 18/11, 2020 at 14:52

2

Solved

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code. Can ...
Cliffhanger asked 5/5, 2017 at 3:32

6

Solved

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.p...
Catamnesis asked 25/9, 2017 at 13:26

2

Solved

In GCP we can see the pipeline execution graph. Is the same possible when running locally via DirectRunner?
Roughhew asked 12/6, 2022 at 14:9

2

Solved

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?
Unmitigated asked 8/12, 2017 at 1:57

3

Solved

I would like to read a csv file and write it to BigQuery using apache beam dataflow. In order to do this I need to present the data to BigQuery in the form of a dictionary. How can I transform the ...
Glossematics asked 15/12, 2016 at 18:30

2

I'm using python beam on google dataflow, my pipeline looks like this: Read image urls from file >> Download images >> Process images The problem is that I can't let Download images step scale...
Yusuk asked 5/9, 2018 at 11:1

3

I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?
Extenuate asked 30/3, 2018 at 17:57

3

Solved

I'm currently new to using Apache Beam in Python with Dataflow runner. I'm interested in creating a batch pipeline that publishes to Google Cloud PubSub, I had tinkered with Beam Python APIs and fo...
Dias asked 24/4, 2019 at 5:21

4

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dat...

0

I am trying to use Apache Beam with Python to fetch JSON data from an API and write it to a BigQuery table. Here is the code I am using: import argparse import json import requests import apache_be...
Gang asked 24/4, 2023 at 10:54

3

I am having some issue with one of Dataflow jobs. From time to time I get this error messages. It seems that after this errors, the job keeps running fine, but, this night it actually stuck, or it ...
Lastly asked 16/4, 2021 at 8:50

3

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once). So far, ...
Versify asked 20/7, 2018 at 9:17

5

Solved

I am new to Beam and struggling to find many good guides and resources to learn best practices. One thing I have noticed is there are two ways pipelines are defined: with beam.Pipeline() as p: # ...
Unarm asked 6/7, 2019 at 12:49

3

Solved

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam...
Orontes asked 24/4, 2017 at 6:26

3

I was reading the docs about SCHEMAS in Apache BEAM but i can not understand what its purpose is, how and why or in which cases should i need to use them. What is the difference between using schem...
Mai asked 16/6, 2020 at 16:13

1

We're facing issues during Dataflow jobs deployment. The error We are using CustomCommands to install private repo on workers, but we face now an error in the worker-startup logs of our jobs: Ru...
Masonite asked 6/1, 2020 at 16:17

2

Solved

I've hit a problem with dockerized Apache Beam. When trying to run the container I am getting "No id provided." message and nothing more. Here's the code and files: Dockerfile FROM apache...
Calvillo asked 15/9, 2021 at 15:14

4

While in a distributed processing environment it is common to use "part" file names such as "part-000", is it possible to write an extension of some sort to rename the individual output file names ...
Luxuriate asked 9/10, 2017 at 3:29

1

I am facing the following error while running a streaming pipeline (python) in Apache Beam on Flinkrunner. The pipeline contains a GCP pub/sub io source and pub/sub target. WARNING:root:Make sure t...
Nikolas asked 12/7, 2021 at 4:58

1

The goal is to store audit logging from different apps/jobs and be able to aggregate them by some ids. We chose to have BigQuery for that purpose and so we need to get a structured information from...

6

I'm having trouble submitting an Apache Beam example from a local machine to our cloud platform. Using gcloud auth list I can see that the correct account is currently active. I can use gsutil and...
Bicentenary asked 25/5, 2017 at 14:32

© 2022 - 2024 — McMap. All rights reserved.