How Hadoop -getmerge works?
Asked Answered
C

1

6

In hadoop getmerge description

Usage: hdfs dfs -getmerge src localdst [addnl]

My question is why getmerge is concatenating to the local destination why not hdfs itself ? This question was asked because i have this following problems

  1. What if the files to be merged are more than the size of the local?
  2. Is there any specific reason behind restricting hadoop -getmerge command to only to concatenate to local-destination?
Complexity answered 15/4, 2016 at 6:51 Comment(1)
I know that this is not your question, but perhaps you will find this post useful:#21776839Gloria
L
7

The getmerge command has been created specifically for merging files from HDFS into a single file on local file system.

This command is very useful to download the output of a MapReduce job, which could have generated multiple part-* files and combine them into a single file locally, which you can use for other operations (for e.g. put it in an Excel sheet for presentation).

Answers to your questions:

  1. If the destination file system does not have enough space, then IOException is thrown. The getmerge internally uses IOUtils.copyBytes() (see IOUtils.copyBytes()) function to copy one file at a time from HDFS to local file. This function throws IOException whenever there is an error in the copy operation.

  2. This command is on similar lines as hdfs fs -get command which gets the file from HDFS to local file system. Only difference is hdfs fs -getmerge merges multiple files from HDFS to local file system.

If you want to merge multiple files in HDFS, you can achieve it using copyMerge() method from FileUtil class (see FileUtil.copyMerge()).

This API copies all files in a directory to a single file (merges all the source files).

Libbylibeccio answered 16/4, 2016 at 7:32 Comment(2)
Is there an alternative for -getmerge where i can merge the files directly from hdfs to hdfs?Complexity
There is no command line functionality for that. As I mentioned in the answer, you need to use "FileUtil.copyMerge()" programmatically or use some Linux shell script tricks like mentioned here: #3548759. For e.g. hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]Libbylibeccio

© 2022 - 2024 — McMap. All rights reserved.