Dataflow/ApacheBeam Limit input to the first X amount?
Asked Answered
E

3

7

I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?

Extenuate answered 30/3, 2018 at 17:57 Comment(3)
There isn't a way to do this natively in Apache Beam. You may be able to manipulate or query the input source in a specific way to only select the first X number of elements. What input source are you reading from?Treadwell
Originally the input is the result of a query from a BigQuery Table. Then it goes through a few steps of processing and further filtration before getting to the step where I need only the first million. However, I can’t put a limit on the query.Extenuate
Maybe you could use the Top transform?Treadwell
C
2

As explained by @Andrew in his comments, maybe you can use the Top transform in Apache Beam (for Java or for Python). Specifically, the Top.of() function returns a PTransform with a PCollection, ordered by a comparator transform.

Here you can find a simple example of use:

PCollection<Student> students = ...;
PCollection<List<Student>> top10Students = students.apply(Top.of(10, new CompareStudentsByAvgGrade()));

And here another example using the Apache Beam Python SDK, which works around the fact that a single element is returned in the PCollection.

Captivate answered 9/4, 2018 at 8:40 Comment(0)
D
2

For a random sample of X elements, you can use the built-in Sample transform (for Python or Java).

Here is an example that shows how to sample 10 elements from an example input of 100 elements:

import apache_beam as beam
from apache_beam.transforms.combiners import Sample

with beam.Pipeline(runner='DirectRunner') as p:
    input = p | beam.Create(range(100))
    output = input | Sample.FixedSizeGlobally(10)
    output | beam.io.WriteToText('output')
Distant answered 25/3, 2021 at 2:2 Comment(0)
U
1

If you don't care about the order and just want a sample of N items, then I think you should be able to use beam.combiners.Sample.FixedSizeGlobally as described here: https://beam.apache.org/documentation/transforms/python/aggregation/sample/

Up answered 16/5, 2023 at 18:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.