How to unzip .gz files in a new directory in hadoop?

Asked 3/1, 2016 at 4:18 Answered 20/5, 2021 at 9:10

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

Doublecheck answered 3/1, 2016 at 4:18 Comment(2)

Will this be of any help ? – Wilburwilburn 3/1, 2016 at 4:32

Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask. – Intranuclear 23/8, 2018 at 10:39

I can think of achieving it through 3 different ways.

Using Linux command line

Following command worked for me.
```
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
```
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt

Using Java program

In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }
        String outputUri =
        CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

This code takes the gz file path as input.
You can execute this as:

FileDecompressor <gzipped file name>

For e.g. when I executed for my gzipped file:

FileDecompressor /tmp/Links.txt.gz

I got the unzipped file at location: /tmp/Links.txt

It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

Using Pig script

You can write a simple Pig script to achieve this.

I wrote the following script, which works:
```
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
```
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
```
The part-m-00000 contains the unzipped file.

Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
```
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

Braise answered 3/1, 2016 at 6:2 Comment(8)

getting error "No codec found for {Path}" in java code. any suggestion?. i checked path of file is correct. still null is assigned in codec – Josuejosy 23/9, 2016 at 5:58

Package "org.apache.hadoop.io.compress" is part of "hadoop-common-<version>.jar". This jar is present in "$HADOOP_HOME/share/hadoop/common". Check if your class path is set properly. For e.g. check if "HADOOP_COMMON_HOME" is set to correct path. It should work. – Braise 23/9, 2016 at 6:18

same error . i added this jar also. & HADOOP_COMMON_HOME is also correct – Josuejosy 23/9, 2016 at 9:54

".zip" file. my file name is something like this "positions_2012-02-14.dat.zip" only one file is there inside zip "positions_2012-02-14.dat" – Josuejosy 23/9, 2016 at 10:28

No. I don't have any performance numbers comparing these options. – Braise 4/10, 2016 at 3:18

I would recommend strongly against option #1 unless you're running it on a node itself. The total network usage would be the sum of the compressed and decompressed size of the file. – Aletaaletha 9/3, 2017 at 21:35

using first way, can i do it for multiple files? – Unpolite 31/7, 2020 at 13:54

I think you will have to write a script and read all files and unzip one file at a time. – Braise 3/8, 2020 at 19:14

If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).

hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a

Ecclesiology answered 30/4, 2017 at 8:30 Comment(0)

Bash solution

In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.

I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.

#!/bin/bash

workdir=/tmp/unziphdfs/
cd $workdir

# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
    echo $hdfsfile

    # copy to temp folder to unpack
    hdfs dfs -copyToLocal $hdfsfile $workdir

    hdfsdir=$(dirname "$hdfsfile")
    zipname=$(basename "$hdfsfile")

    # unpack locally and remove
    unzip $zipname
    rm -rf $zipname

    # copy files back to hdfs
    files=$(ls $workdir)
    for file in $files; do
       hdfs dfs -copyFromLocal $file $hdfsdir
       rm -rf $file
    done

    # optionally remove the zip file from hdfs?
    # hadoop fs -rm -skipTrash $hdfsfile
done

Description

Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Unzip
Copy all the extracted files to the dir of the zip file
Cleanup

I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.

Good luck :)

Marguerite answered 16/6, 2017 at 15:13 Comment(0)

You can do this using hive (assuming it is text data).

create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;

Data will be uncompressed into new set of files.

if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.

hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>

Propriety answered 3/1, 2016 at 6:1 Comment(2)

I like this approach. If <directory_with_gz_files> has just one .gz file, does this approach use more than one mapper? I.e. does it gunzip in parallel, or it will be single-threaded operation? Afaik, gz is not splittable. Thanks. – Shortwave 23/10, 2016 at 6:16

If you're default format is not text (e.g. orc), remember to add STORED AS TEXTFILE to the source table definition. – Metalepsis 20/11, 2017 at 21:44

Providing the scala code

import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
 val conf = new org.apache.hadoop.conf.Configuration()


 def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
         val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
         val inputPath  = new Path(compath)
         val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
       val codec = factory.getCodec(inputPath)
         if (codec == null){
           throw new RuntimeException(s"Not a valid codex $codec")
         }
    
         var in : CompressionInputStream = null;
         var out : FSDataOutputStream = null;
         try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(uncompPath));
           IOUtils.copyBytes(in, out, conf);
         } finally {
           IOUtils.closeStream(in);
           IOUtils.closeStream(out);
         }
         uncompPath
       }

Gregorio answered 20/5, 2021 at 9:10 Comment(0)

-1

Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.

Haimes answered 14/1, 2018 at 21:33 Comment(1)

How to read a zip file from HDFS location and unzip that to another HDFS location – Sematic 10/5, 2021 at 5:3

Bash solution

Description

Recommended topics

Hot tags