How can I concatenate two files in hadoop into one using Hadoop FS shell?
Asked Answered
P

2

7

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)

Here is the command I'm submitting (names have been changed):

**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**

It returns bash: /user/username/folder/outputdirectory/: No such file or directory

I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.

I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.

The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.

Persuasion answered 15/5, 2012 at 19:43 Comment(0)
L
10

The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:

bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv

-getmerge also outputs to the local file system, not HDFS

Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in

  • a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
  • via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
Legislature answered 15/5, 2012 at 21:6 Comment(2)
Using the hadoop fs -put as you suggested did exactly that I needed -- it concatenated the two CSVs into a third file on HDFS. Thank you so much for your help Chris!Persuasion
Append support is not available in 20.2, because it is not included. it was merged later into a follow up version.Darnel
C
6

To concatenate all files in the folder to an output file:

hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt

If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)

Syntax :

for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done

eg:

for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done

Explanation: So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Condensable answered 3/11, 2014 at 19:17 Comment(1)
what is suy hadoop? I get a syntax error in that. Its not working as expected, getting an errorr as below. appreciate if anybody can help [hadoop@ip-10-171-17-77 ~]$ for i in { ${header}}, ${input_location} } ; do hadoop fs -cat $i/* | hadoop fs -put - ${input_location}/test.txt ; done cat: Illegal file pattern: Unclosed group near index 1 put: /user/hadoop/wmg_monthly_plus/test.txt': File exists put: /user/hadoop/wmg_monthly_plus/test.txt': File exists cat: Unable to write to output stream.Marijn

© 2022 - 2024 — McMap. All rights reserved.