Join vs COGROUP in PIG - McMap

About

Join vs COGROUP in PIG

Asked 21/9, 2011 at 7:23 Answered 21/9, 2011 at 13:13

Solved hadoop apache-pig

F

1

13

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ?

http://developer.yahoo.com/hadoop/tutorial/module6.html talks about the difference in the type of output they produce. But, ignoring the "output schema", are there any significant difference in performance ?

Fated answered 21/9, 2011 at 7:23 Comment(0)

C

15

There are no major performance differences. The reason I say this is they both end up being a single MapReduce job that send the same data forward to the reducers. Both need to send all of the records forward with the key being the foreign key. If at all, the COGROUP might be a bit faster because it does not do the cartesian product across the hits and keeps them in separate bags.

If one of your data sets is small, you can use a join option called "replicated join". This will distribute the second data set across all map tasks and load it into main memory. This way, it can do the entire join in the mapper and not need a reducer. In my experience, this is very worth it because the bottleneck in joins and cogroups is the shuffling of the entire data set to the reducer. You can't do this with COGROUP, to my knowledge.

Cappadocia answered 21/9, 2011 at 13:13 Comment(2)

Internally, join and cogroup are the same thing in Pig (and there is no cartesian join going on in joins, not sure what you mean there). Only the format of the end result changes depending on which keyword you used. Try "describe" on the result of a join -- you'll see the COGROUP operator in the explain plan. – Empoison 25/9, 2011 at 9:37

What I mean by cartesian product, is that if there are multiple matches on the foreign key, you will get more records. For example, if there are 3x "abcde" in the one data set, and then 4x "abcde" in the second data set, it outputs 12x records because it matches each one up. Meanwhile, in COGROUP, it keeps the relations separate. – Cappadocia 25/9, 2011 at 15:48

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.