I understand that you want to join two PCollections in a way they follow this syntax : ['element1','element2']. In order to achieve that you can use CoGroupByKey() instead of Flatten().
Considering your code snippet, the syntax would:
def run():
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
root = p | 'Get source' >> beam.Create([
"source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
])
metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]
metric3 = (
(metric1, metric2)
| beam.CoGroupByKey()
| beam.ParDo(RunTask())
)
I would like to point out the difference between Flatten() and CoGroupByKey().
1) Flatten() receives two or more PCollections, which stores the same data type, and merge them into one logical PCollection. For example,
import apache_beam as beam
from apache_beam import Flatten, Create, ParDo, Map
p = beam.Pipeline()
adress_list = [
('leo', 'George St. 32'),
('ralph', 'Pyrmont St. 30'),
('mary', '10th Av.'),
('carly', 'Marina Bay 1'),
]
city_list = [
('leo', 'Sydney'),
('ralph', 'Sydney'),
('mary', 'NYC'),
('carly', 'Brisbane'),
]
street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)
resul =(
(street,city)
|beam.Flatten()
|ParDo(print)
)
p.run()
And the output,
('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')
Notice that, both PCollections are in the output. However, one is appended to the other.
2) CoGroupByKey() performs a relational join between two or more key value PCollections, which have the same key type. Using this method you will perform a join by key, not appending as done in Flatten(). Below is an example,
import apache_beam as beam
from apache_beam import Flatten, Create, ParDo, Map
p = beam.Pipeline()
address_list = [
('leo', 'George St. 32'),
('ralph', 'Pyrmont St. 30'),
('mary', '10th Av.'),
('carly', 'Marina Bay 1'),
]
city_list = [
('leo', 'Sydney'),
('ralph', 'Sydney'),
('mary', 'NYC'),
('carly', 'Brisbane'),
]
street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)
results = (
(street, city)
| beam.CoGroupByKey()
|ParDo(print)
#| beam.io.WriteToText('delete.txt')
)
p.run()
And the output,
('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))
Notice that you need a primary key in order to join the results. Also, this output is what you expect in your case.