How to read compressed HTML page with Content-Encoding : gzip
Asked Answered
G

3

8

I request a web page that sends a Content-Encoding: gzip header, but got stuck how to read it..

My code:

    try {
        URLConnection connection = new URL("http://jquery.org").openConnection();                        
        String html = "";
        BufferedReader in = null;
        connection.setReadTimeout(10000);
    in = new BufferedReader(new InputStreamReader(connection.getInputStream()));            
    String inputLine;
    while ((inputLine = in.readLine()) != null){
    html+=inputLine+"\n";
        }
    in.close();
        System.out.println(html);
        System.exit(0);
    } catch (IOException ex) {
        Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
    }

The output looks very messy.. (I was unable to paste it here, a sort of symbols..)

I believe this is a compressed content, how to parse it?

Note:
If I change jquery.org to jquery.com (which don't send that header, my code works well)

Gerius answered 19/6, 2012 at 1:11 Comment(0)
G
5

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

Geometrize answered 19/6, 2012 at 1:13 Comment(1)
To get it to work in both cases, you need to look at the "Content-Encoding" header that is returned. If its value is "gzip" then you should use the GZipInputStream, otherwise do not.Suppositive
G
16

Actually, this is pb2q's answer, but I post the full code for future readers

try {
    URLConnection connection = new URL("http://jquery.org").openConnection();                        
    String html = "";
    BufferedReader in = null;
    connection.setReadTimeout(10000);
    //The changed part
    if (connection.getHeaderField("Content-Encoding")!=null && connection.getHeaderField("Content-Encoding").equals("gzip")){
        in = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));            
    } else {
        in = new BufferedReader(new InputStreamReader(connection.getInputStream()));            
    }     
    //End        
    String inputLine;
    while ((inputLine = in.readLine()) != null){
    html+=inputLine+"\n";
    }
in.close();
    System.out.println(html);
    System.exit(0);
} catch (IOException ex) {
    Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}
Gerius answered 19/6, 2012 at 1:23 Comment(1)
Worked for me. Just to add to this, the compressed form can be x-gzip as well. But thanks a lot.Transatlantic
G
5

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

Geometrize answered 19/6, 2012 at 1:13 Comment(1)
To get it to work in both cases, you need to look at the "Content-Encoding" header that is returned. If its value is "gzip" then you should use the GZipInputStream, otherwise do not.Suppositive
H
0

there are two cases with Content-Encoding:gzip header

  1. if data already compressed(by application), Content-Encoding:gizp header will cause data to compressed again.so its double compressed.it's because http compression

  2. if data is not compressed by application, Content-Encoding:gizp will cause data to compress(gzip mostly) and it will automatically uncompressed(un-zip) before it reaches to client. un-zip is default feature available in most of web browsers. browser will do un-zip if it finds Content-Encoding:gizp header in the response.

Hefter answered 24/12, 2015 at 0:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.