Editing a multi million row file on Hadoop cluster

MapReduce is perfect for this. Good thing you have it in HDFS!

You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.

$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
   -D mapred.reduce.tasks=0 \
   -input MyLargeFiles \
   -output outputdir \
   -mapper "sed ..."

This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.

Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.

Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:

$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt

The - in hadoop fs -put tells it to take data from stdin instead of a file.

Recommended topics

Hot tags