Using Pig/Hive for data processing instead of direct java map reduce code?

Asked 7/11, 2011 at 14:38 Answered 6/3, 2012 at 10:57

(Even more basic than Difference between Pig and Hive? Why have both?)

I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.

What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?

Maegan answered 7/11, 2011 at 14:38 Comment(0)

Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.

The above reference also talks about pros and cons of Pig over developing applications in MapReduce.

As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.

Brophy answered 7/11, 2011 at 16:45 Comment(1)

(I work on Pig at Twitter): The 110-150% number is somewhat arbitrary. Frequently, Pig will be way faster than your code because it does a lot of optimizations. Fundamentally, it translates things to MR, so it can't be faster than MR. But straightforward beginner-to-intermediate MR code will frequently lose out to Pig. – Southey 11/11, 2011 at 16:36

In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).

I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.

Shulock answered 6/3, 2012 at 10:57 Comment(0)

Recommended topics

Hot tags