Difference between beam.ParDo and beam.Map in the output type?

class Printer(beam.DoFn): def process(self,data_item): print data_item class DateExtractor(beam.DoFn): def process(self,data_item): return (str(data_item).split(','))[0] data_from_source = (p | 'ReadMyFile 01' >> ReadFromText('./input/data.csv') | 'Splitter using beam.ParDo 01' >> beam.ParDo(DateExtractor()) | 'Printer the data 01' >> beam.ParDo(Printer()) ) copy_of_the_data = (p | 'ReadMyFile 02' >> ReadFromText('./input/data.csv') | 'Splitter using beam.Map 02' >> beam.Map(lambda record: (record.split(','))[0]) | 'Printer the data 02' >> beam.ParDo(Printer()) )

Short Answer

You need to wrap the return value of a ParDo into a list.

Longer Version

ParDos in general can return any number of outputs for a single input, i.e. for a single input string you can emit zero, one, or many results. For this reason the Beam SDK treats the output of a ParDo as not a single element, but a collection of elements.

In your case the ParDo emits a single string instead of a collection. Beam Python SDK still tries to interpret the output of that ParDo as if it was a collection of elements. And it does so by interpreting the string you emitted as collection of characters. Because of that, your ParDo now effectively produces a stream of single characters, not a stream of strings.

What you need to do is wrap your return value into a list:

class DateExtractor(beam.DoFn):
    def process(self,data_item):
        return [(str(data_item).split(','))[0]]

Notice the square brackets. See the programming guide for more examples.

Map, on the other hand, can be thought of as a special case of ParDo. Map is expected to produce exactly one output for each input. So in this case you can just return a single value out of lambda and it works as expected.

And you probably don't need to wrap the data_item in str. According to the docs the ReadFromText transform produces strings.

Recommended topics

Hot tags