GZIPInputStream reading line by line
Asked Answered
A

5

96

I have a file in .gz format. The java class for reading this file is GZIPInputStream. However, this class doesn't extend the BufferedReader class of java. As a result, I am not able to read the file line by line. I need something like this

reader  = new MyGZInputStream( some constructor of GZInputStream) 
reader.readLine()...

I though of creating my class which extends the Reader or BufferedReader class of java and use GZIPInputStream as one of its variable.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.Reader;
import java.util.zip.GZIPInputStream;

public class MyGZFilReader extends Reader {

    private GZIPInputStream gzipInputStream = null;
    char[] buf = new char[1024];

    @Override
    public void close() throws IOException {
        gzipInputStream.close();
    }

    public MyGZFilReader(String filename)
               throws FileNotFoundException, IOException {
        gzipInputStream = new GZIPInputStream(new FileInputStream(filename));
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        // TODO Auto-generated method stub
        return gzipInputStream.read((byte[])buf, off, len);
    }

}

But, this doesn't work when I use

BufferedReader in = new BufferedReader(
    new MyGZFilReader("F:/gawiki-20090614-stub-meta-history.xml.gz"));
System.out.println(in.readLine());

Can someone advice how to proceed ..

Alp answered 3/7, 2009 at 18:19 Comment(2)
look at this link https://mcmap.net/q/219276/-how-can-i-zip-and-unzip-a-string-using-gzipoutputstream-that-is-compatible-with-net/779408. A compress and decompress method is represented there.Domain
For the love of all that is good and right in this world and for the sanity of any developers who write even remotely worthwhile code.....BE AWARE OF ENCODING AS @erickson POINTS OUT! He is the only answer that points this out, which makes me want to cry.Banausic
T
156

The basic setup of decorators is like this:

InputStream fileStream = new FileInputStream(filename);
InputStream gzipStream = new GZIPInputStream(fileStream);
Reader decoder = new InputStreamReader(gzipStream, encoding);
BufferedReader buffered = new BufferedReader(decoder);

The key issue in this snippet is the value of encoding. This is the character encoding of the text in the file. Is it "US-ASCII", "UTF-8", "SHIFT-JIS", "ISO-8859-9", …? there are hundreds of possibilities, and the correct choice usually cannot be determined from the file itself. It must be specified through some out-of-band channel.

For example, maybe it's the platform default. In a networked environment, however, this is extremely fragile. The machine that wrote the file might sit in the neighboring cubicle, but have a different default file encoding.

Most network protocols use a header or other metadata to explicitly note the character encoding.

In this case, it appears from the file extension that the content is XML. XML includes the "encoding" attribute in the XML declaration for this purpose. Furthermore, XML should really be processed with an XML parser, not as text. Reading XML line-by-line seems like a fragile, special case.

Failing to explicitly specify the encoding is against the second commandment. Use the default encoding at your peril!

Toting answered 3/7, 2009 at 18:24 Comment(8)
thanks it worked... However, there is no need for reader step .. we can also write it as GZIPInputStream gzip = new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz")); BufferedReader br = new BufferedReader(new InputStreamReader(gzip));Alp
@KapilD it makes me sad that you completely missed his point about the encoding...as shown by your comment and the example in your comment. Re-read erickson's answer....maybe 30 times over.Banausic
How does gzip command know the encoding? I want to read a lot of files from a lot of linux/unix servers from all over the world... so I want to make sure I do this right... The post mentions encoding usually can not be determined by the file itself... but the gzip -d command seems to work on any file without separate input... (its what I use now but want to circumvent) so I figure if I can just figure out what gzip does to know the encoding, I can do the same. Any thoughts/suggestions can anyone point me in the right direction?Attestation
@Attestation Your question is not clear. Do you mean how can you recognize a gzip file in the absence of some external assertion about content type? One hint is the file extension, another is the presence of the magic number 0x1F8B in the file header. However, you can't know a file is a valid gzip file until you actually process the whole thing.Toting
To be clear I know these files are gzip files. And the gzipped files are all text based files, like csv and pipe delim files. I just want to be able to read these files directly with java line by line. I can gzip -d them and then read them line by line no problem. I was just confused in your comments about having to specify the encoding... I would think most of the files are ASCII... but some might have asian characters so maybe UTF-8? I just want to make sure I do this correctly... Is that any clearer? Thanks!Attestation
@Attestation I see. The gzip command doesn't know the text encoding; it stops after decompressing the file as a string of bytes. It's the other tools on your system that are making assumptions about the character encoding when they render the content as text. In my example, I use an "encoding" parameter that has to be specified somehow. There are some heuristics you can use to guess at the encoding based on content, but they aren't guaranteed.Toting
I see - So the encoding is not needed to unzip but to read it afterwards? Currently I gzip -d the file, and then open it in java and read through line by line with no problem without specifying encoding. I am using a filereader (instead of an inputstreamreader) in my bufferedreader.. is that filereader doing the heuristics for me then? Can the inputstreamreader do the same heuristics? Or is there already built code I can use to do the heuristics? I would prefer to use something well tested, than write my own. Can you point me in the right direction? Thanks!Attestation
Against the Second Commandment then means .. what .. we are idolizing /making an idol of the default encoding?Iffy
R
45
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));
br.readLine();

Rutheruthenia answered 3/7, 2009 at 18:23 Comment(1)
Your answer is great. Short and concise .. However, erickson's answer is more detailed.Alp
S
4
BufferedReader in = new BufferedReader(new InputStreamReader(
        new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"))));

String content;

while ((content = in.readLine()) != null)

   System.out.println(content);
Salutation answered 22/8, 2012 at 13:6 Comment(0)
D
3

You can use the following method in a util class, and use it whenever necessary...

public static List<String> readLinesFromGZ(String filePath) {
    List<String> lines = new ArrayList<>();
    File file = new File(filePath);

    try (GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(file));
            BufferedReader br = new BufferedReader(new InputStreamReader(gzip));) {
        String line = null;
        while ((line = br.readLine()) != null) {
            lines.add(line);
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace(System.err);
    } catch (IOException e) {
        e.printStackTrace(System.err);
    }
    return lines;
}
Delano answered 23/7, 2018 at 12:10 Comment(0)
I
3

here is with one line

try (BufferedReader br = new BufferedReader(
        new InputStreamReader(
           new GZIPInputStream(
              new FileInputStream(
                 "F:/gawiki-20090614-stub-meta-history.xml.gz"))))) 
     {br.readLine();}
Intermigration answered 9/11, 2018 at 10:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.