Piping data into jobs in Hadoop MR/Pig

Asked 16/12, 2014 at 13:58 Answered 16/1, 2015 at 19:43

I have three different type of jobs running on the data in HDFS. These three jobs have to be run separately in the current scenario. Now, we want to run the three jobs together by piping the OUTPUT data of one job to the other job without writing the data in HDFS to improve the architecture and overall performance.

Any suggestions are welcome for this scenario.

PS : Oozie is not fitting for the workflow.Cascading framework is also ruled out because of Scalability issues. Thanks

Licht answered 16/12, 2014 at 13:58 Comment(0)

Hadoop inherently writes to storage (e.g. HDFS) after M/R steps. If you want something in memory, maybe you need to look into something like Spark.

Smash answered 18/12, 2014 at 13:50 Comment(1)

Thanks Marc. We are looking into that option also. – Licht 18/12, 2014 at 14:3

Oozie helps to chain multiple hadoop jobs(mapreduce, pig, hive, java etc.) together to form a data pipeline application. The built-in support of scheduling and hadoop-related functions makes dev's life much easier to manage complex hadoop related jobs.

However Oozie doesn't necessarily eliminate data storage in HDFS or other forms such as local file system or database, etc. To do that you would need to introduce some in-memory data store, message-queue systems or other system which works for the scale of data you have.

Nematic answered 17/12, 2014 at 0:33 Comment(3)

I am working on the Oozie workflow process, lets c if something useful comes out of it. Is cascading a good approach for my situation? – Licht 17/12, 2014 at 2:56

Cascading may help in your scenario. It's similar to Pig or Hive in the sense that they convert data transformation presented in domain specific languages (pig latin, hive or java) to map/reduce jobs underneath. It helps performance assuming that a well maintained compiler can do a better job than individual developer. :) – Nematic 17/12, 2014 at 19:49

thanks Paul .. as of now Cascading is not in picture because of scalability issues.We are looking forward to making PIG scripts for out MR jobs and one other possible solution is Crunch. – Licht 18/12, 2014 at 11:24

-1

you may try using HUE. Refer: http://blog.cloudera.com/blog/2014/10/new-in-cdh-5-2-new-security-app-and-more-in-hue/

CDH 5.2 includes important new usability functionality via Hue, the open source GUI that makes Apache Hadoop easy to use. In addition to shipping a brand-new app for managing security permissions, this release is particularly feature-packed, and is becoming a great complement to BI tools from Cloudera partners like Tableau, MicroStrategy, and Zoomdata because a more usable Hadoop translates into better BI overall across your organization!

Diplostemonous answered 16/1, 2015 at 19:43 Comment(0)

Recommended topics

Hot tags