Apache Beam : FlatMap vs Map?
Asked Answered
P

3

38

I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me.

I still do not understand in which scenario I should use the transformation of FlatMap or Map.

Could someone give me an example so I can understand their difference?

I understand the difference of FlatMap vs Map in Spark, and however not sure if there any similarity?

Poult answered 14/8, 2017 at 9:1 Comment(0)
A
98

These transforms in Beam are exactly same as Spark (Scala too).

A Map transform, maps from a PCollection of N elements into another PCollection of N elements.

A FlatMap transform maps a PCollections of N elements into N collections of zero or more elements, which are then flattened into a single PCollection.

As a simple example, the following happens:

beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any'])
# The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']]

Whereas:

beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any'])
# The lists that are output by the lambda, are then flattened into a
# collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any']
Akins answered 14/8, 2017 at 21:1 Comment(4)
Pablo- Got it. Thank you for your detailed explaination and examples. :)Poult
Excellent explanation +1Yapon
A minor clarification: as always with PCollection, the order is arbitrary - so it could be [1, 2, 3, 'any', 'any', 'any']. Also, as you'd expect, FlatMap requires that the function passed to it return a list; that is, beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: x) will raise an exception.Tot
the fact that this basic thing (and many more) isn't clearly documented anywhere says a lot about this project and its support to python. The documentation available is a nightmare... I have rarely come across documentation that is so bad.Jadejaded
L
6

Let me show you one example

import apache_beam as beam

def categorize_explode(text):
  result = text.split(':')
  category = result[0]
  elements = result[1].split(',')
  return list(map(lambda x: (category, x), elements))

with beam.Pipeline() as pipeline:
  things = (
      pipeline
      | 'Categories and Elements' >> beam.Create(["Vehicles:Car,Jeep,Truck,BUS,AIRPLANE","FOOD:Burger,SANDWICH,ICECREAM,APPLE"])
      | 'Explode' >> beam.FlatMap(categorize_explode)
      | beam.Map(print)
  )

As you can see categorize_explode function splits the strings into categories and corresponding elements and returns iterator like [('Vehicles','Car'),('Vehicles','Jeep'),...]

FlatMap takes each element in this iterator and treats each element as a separate element in PCollection.

So the result would be:

('Vehicles', 'Car')
('Vehicles', 'Jeep')
('Vehicles', 'Truck')
('Vehicles', 'BUS')
('Vehicles', 'AIRPLANE')
('FOOD', 'Burger')
('FOOD', 'SANDWICH')
('FOOD', 'ICECREAM')
('FOOD', 'APPLE')

While Map performs one to one mapping. i.e. this iterator [('Vehicles','Car'),('Vehicles','Jeep'),...] would be returned as it is.

So the result would be for Map:

[('Vehicles', 'Car'), ('Vehicles', 'Jeep'), ('Vehicles', 'Truck'), ('Vehicles', 'BUS'), ('Vehicles', 'AIRPLANE')]
[('FOOD', 'Burger'), ('FOOD', 'SANDWICH'), ('FOOD', 'ICECREAM'), ('FOOD', 'APPLE')]

The approach I have used is somewhat similar to spark explode transform.

Hope this helps!!!

Lament answered 4/3, 2020 at 13:54 Comment(0)
A
3

In simplest word,

Map transformation is "one to one" mapping on each element of list/collection. Ex -

{"Amar", "Akabar", "Anthony"} -> {"Mr.Amar", "Mr.Akabar", "Mr.Anthony"}

FlatMap transformation is usually on collection like "list of list", and this collection gets flattened to single list and transformation/mapping is applied on each element of "list of list"/collection.

FlatMap transformation Ex -

{ {"Amar", "Akabar"},  "Anthony"} -> {"Mr.Amar", "Mr.Akabar", "Mr.Anthony"}

This concept remains same across programming language and across platform.

Hope it helps.

Armandoarmature answered 27/1, 2020 at 17:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.