I have a reducer that needs to output results to different directories so that we can later use the output as input to Hive as a partitioned table. (Hive creates partitions based on folder name). In order to write out to these locations, we are currently not using any Hadoop framework to accomplish this, we are just writing out to separate locations "behind Hadoop's back", so to speak. In other words we are not using hadoop's API to output these files.
We had issues with mapred.reduce.tasks.speculative.execution
set to true
. I understand this to be the case because multiple task attempts for the same task are writing to the same location.
Is there a way to correctly use Hadoop's API to output to several different folders from the same reducer such that I can also use mapred.reduce.tasks.speculative.execution=true
? (I know about MultipleOutputs, which I'm not sure supports speculative execution.)
If so, is there a way to do that and output to S3?