Hadoop Reducer: How can I output to multiple directories using speculative execution?

Asked 12/2, 2013 at 1:51 Answered 8/4, 2014 at 4:49

I have a reducer that needs to output results to different directories so that we can later use the output as input to Hive as a partitioned table. (Hive creates partitions based on folder name). In order to write out to these locations, we are currently not using any Hadoop framework to accomplish this, we are just writing out to separate locations "behind Hadoop's back", so to speak. In other words we are not using hadoop's API to output these files.

We had issues with mapred.reduce.tasks.speculative.execution set to true. I understand this to be the case because multiple task attempts for the same task are writing to the same location.

Is there a way to correctly use Hadoop's API to output to several different folders from the same reducer such that I can also use mapred.reduce.tasks.speculative.execution=true ? (I know about MultipleOutputs, which I'm not sure supports speculative execution.)

If so, is there a way to do that and output to S3?

Remitter answered 12/2, 2013 at 1:51 Comment(0)

The way Hadoop typically deals with speculative execution is to create an output folder for each task attempt (in a _temporary subfolder of the actual HDFS output directory).

The OutputCommitter for the OutputFormat then simply moves the contents of the temp task folder to the actual output folder when a task succeeds, and deletes the other temp task folders for those failed / aborted (this is the default behavior for most FileOutputFormats)

So for your case, if you are writing to a folder outside of the job output folder, then you'll need to extend / implement your own output committer. I'd follow the same principals when creating the files - include the full task id (including the attempt id) to avoid name collisions when speculatively executing. How you track the files created in your job and manage the deletion in the abort / fail scenarios is up to you (maybe some file globing for the task ids?)

Asiatic answered 12/2, 2013 at 2:47 Comment(4)

Does MultipleOutputs somehow automatically support speculative execution? (I'm assuming 'no'.) – Remitter 13/2, 2013 at 1:29

Yes it does - like i said, the reducer output is written to a temp subfolder - MultipleOutputs also puts its output into this temp subfolder – Asiatic 13/2, 2013 at 10:8

So can I use a custom MultipleOutputs instead of a custom OutputCommitter, or will I still need to extend OutputCommitter in addition to this? – Remitter 13/2, 2013 at 20:8

If you want to output everything to the same output directory (as the part-r-xxxxx) files then you just need to use MultipleOutputs. Even if you don't, just move them once the job has completed – Asiatic 14/2, 2013 at 0:24

You might be interested in this : http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Nonferrous answered 8/4, 2014 at 4:49 Comment(0)

Recommended topics

Hot tags