How to unzip .gz files in a new directory in hadoop?
Asked Answered
D

6

19

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

Doublecheck answered 3/1, 2016 at 4:18 Comment(2)
Will this be of any help ?Wilburwilburn
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.Intranuclear
B
43

I can think of achieving it through 3 different ways.

  1. Using Linux command line

    Following command worked for me.

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
    

    My gzipped file is Links.txt.gz
    The output gets stored in /tmp/unzipped/Links.txt

  2. Using Java program

    In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

    package com.myorg.hadooptests;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.URI;
    
    public class FileDecompressor {
        public static void main(String[] args) throws Exception {
            String uri = args[0];
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            Path inputPath = new Path(uri);
            CompressionCodecFactory factory = new CompressionCodecFactory(conf);
            CompressionCodec codec = factory.getCodec(inputPath);
            if (codec == null) {
                System.err.println("No codec found for " + uri);
                System.exit(1);
            }
            String outputUri =
            CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
            InputStream in = null;
            OutputStream out = null;
            try {
                in = codec.createInputStream(fs.open(inputPath));
                out = fs.create(new Path(outputUri));
                IOUtils.copyBytes(in, out, conf);
            } finally {
                IOUtils.closeStream(in);
                IOUtils.closeStream(out);
            }
        }
    }
    

    This code takes the gz file path as input.
    You can execute this as:

    FileDecompressor <gzipped file name>
    

    For e.g. when I executed for my gzipped file:

    FileDecompressor /tmp/Links.txt.gz
    

    I got the unzipped file at location: /tmp/Links.txt

    It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

    Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

  3. Using Pig script

    You can write a simple Pig script to achieve this.

    I wrote the following script, which works:

    A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
    Store A into '/tmp/tmp_unzipped/' USING PigStorage();
    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

    /tmp/tmp_unzipped/_SUCCESS
    /tmp/tmp_unzipped/part-m-00000
    

    The part-m-00000 contains the unzipped file.

    Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

    Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

Braise answered 3/1, 2016 at 6:2 Comment(8)
getting error "No codec found for {Path}" in java code. any suggestion?. i checked path of file is correct. still null is assigned in codecJosuejosy
Package "org.apache.hadoop.io.compress" is part of "hadoop-common-<version>.jar". This jar is present in "$HADOOP_HOME/share/hadoop/common". Check if your class path is set properly. For e.g. check if "HADOOP_COMMON_HOME" is set to correct path. It should work.Braise
same error . i added this jar also. & HADOOP_COMMON_HOME is also correctJosuejosy
".zip" file. my file name is something like this "positions_2012-02-14.dat.zip" only one file is there inside zip "positions_2012-02-14.dat"Josuejosy
No. I don't have any performance numbers comparing these options.Braise
I would recommend strongly against option #1 unless you're running it on a node itself. The total network usage would be the sum of the compressed and decompressed size of the file.Aletaaletha
using first way, can i do it for multiple files?Unpolite
I think you will have to write a script and read all files and unzip one file at a time.Braise
E
6

If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).

hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a
Ecclesiology answered 30/4, 2017 at 8:30 Comment(0)
M
5

Bash solution

In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.

I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.

#!/bin/bash

workdir=/tmp/unziphdfs/
cd $workdir

# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
    echo $hdfsfile

    # copy to temp folder to unpack
    hdfs dfs -copyToLocal $hdfsfile $workdir

    hdfsdir=$(dirname "$hdfsfile")
    zipname=$(basename "$hdfsfile")

    # unpack locally and remove
    unzip $zipname
    rm -rf $zipname

    # copy files back to hdfs
    files=$(ls $workdir)
    for file in $files; do
       hdfs dfs -copyFromLocal $file $hdfsdir
       rm -rf $file
    done

    # optionally remove the zip file from hdfs?
    # hadoop fs -rm -skipTrash $hdfsfile
done

Description

  1. Get all the *.zip files in an hdfs dir
  2. One-by-one: copy zip to a temp dir (on filesystem)
  3. Unzip
  4. Copy all the extracted files to the dir of the zip file
  5. Cleanup

I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.

Good luck :)

Marguerite answered 16/6, 2017 at 15:13 Comment(0)
P
2

You can do this using hive (assuming it is text data).

create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;

Data will be uncompressed into new set of files.

if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.

hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>
Propriety answered 3/1, 2016 at 6:1 Comment(2)
I like this approach. If <directory_with_gz_files> has just one .gz file, does this approach use more than one mapper? I.e. does it gunzip in parallel, or it will be single-threaded operation? Afaik, gz is not splittable. Thanks.Shortwave
If you're default format is not text (e.g. orc), remember to add STORED AS TEXTFILE to the source table definition.Metalepsis
G
1

Providing the scala code

import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
 val conf = new org.apache.hadoop.conf.Configuration()


 def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
         val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
         val inputPath  = new Path(compath)
         val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
       val codec = factory.getCodec(inputPath)
         if (codec == null){
           throw new RuntimeException(s"Not a valid codex $codec")
         }
    
         var in : CompressionInputStream = null;
         var out : FSDataOutputStream = null;
         try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(uncompPath));
           IOUtils.copyBytes(in, out, conf);
         } finally {
           IOUtils.closeStream(in);
           IOUtils.closeStream(out);
         }
         uncompPath
       }
Gregorio answered 20/5, 2021 at 9:10 Comment(0)
H
-1

Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.

Haimes answered 14/1, 2018 at 21:33 Comment(1)
How to read a zip file from HDFS location and unzip that to another HDFS locationSematic

© 2022 - 2024 — McMap. All rights reserved.