Editing a multi million row file on Hadoop cluster
Asked Answered
F

1

5

I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,@," etc from the file. I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.

Fortunetelling answered 20/2, 2014 at 19:28 Comment(0)
G
7

MapReduce is perfect for this. Good thing you have it in HDFS!

You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.

$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
   -D mapred.reduce.tasks=0 \
   -input MyLargeFiles \
   -output outputdir \
   -mapper "sed ..."

This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.

Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.


Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:

$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt

The - in hadoop fs -put tells it to take data from stdin instead of a file.

Goering answered 20/2, 2014 at 19:40 Comment(2)
Thanks Donald. It helps. :) Cheers.Fortunetelling
Hi, The second option works fine but not the mapred method. I want to apply multiple sed operations on the file eg hadoop fs -cat file1 | sed '1d' | sed 's/^A//g' | sed 's/|//g' | sed 's/"//g' | sed 's/ \+//g' | hadoop fs -put file2 If i use mapred, its not working for ^A," and spaces. Error - /bin/sed: can't read s/ \+//g: No such file or directory I am trying : $ hadoop jar /path/to/hadoop/hadoop-streaming.jar \ -D mapred.reduce.tasks=1 \ -input file1 -output file2 -mapper "sed 's/|//g';'s/ \+//g';'s/"//g';'other sed operations'" Think i am wrong on the mapper part. Please correct meFortunetelling

© 2022 - 2024 — McMap. All rights reserved.