I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?
As explained by @Andrew in his comments, maybe you can use the Top
transform in Apache Beam (for Java or for Python). Specifically, the Top.of()
function returns a PTransform with a PCollection, ordered by a comparator transform.
Here you can find a simple example of use:
PCollection<Student> students = ...;
PCollection<List<Student>> top10Students = students.apply(Top.of(10, new CompareStudentsByAvgGrade()));
And here another example using the Apache Beam Python SDK, which works around the fact that a single element is returned in the PCollection.
For a random sample of X elements, you can use the built-in Sample transform (for Python or Java).
Here is an example that shows how to sample 10 elements from an example input of 100 elements:
import apache_beam as beam
from apache_beam.transforms.combiners import Sample
with beam.Pipeline(runner='DirectRunner') as p:
input = p | beam.Create(range(100))
output = input | Sample.FixedSizeGlobally(10)
output | beam.io.WriteToText('output')
If you don't care about the order and just want a sample of N items, then I think you should be able to use beam.combiners.Sample.FixedSizeGlobally
as described here: https://beam.apache.org/documentation/transforms/python/aggregation/sample/
© 2022 - 2024 — McMap. All rights reserved.