ZipInputStream(InputStream, Charset) decodes ZipEntry file name falsely
Asked Answered
C

1

7

Java 7 is supposed to fix an old problem with unpacking zip archives with character sets other than UTF-8. This can be achieved by constructor ZipInputStream(InputStream, Charset). So far, so good. I can unpack a zip archive containing file names with umlauts in them when explicitly setting an ISO-8859-1 character set.

But here is the problem: When iterating over the stream using ZipInputStream.getNextEntry(), the entries have wrong special characters in their names. In my case the umlaut "ü" is replaced by a "?" character, which is obviously wrong. Does anybody know how to fix this? Obviously ZipEntry ignores the Charset of its underlying ZipInputStream. It looks like yet another zip-related JDK bug, but I might be doing something wrong as well.

...
zipStream = new ZipInputStream(
    new BufferedInputStream(new FileInputStream(archiveFile), BUFFER_SIZE),
    Charset.forName("ISO-8859-1")
);
while ((zipEntry = zipStream.getNextEntry()) != null) {
    // wrong name here, something like "M?nchen" instead of "München"
    System.out.println(zipEntry.getName());
    ...
}
Copyright answered 30/6, 2012 at 17:56 Comment(3)
what are best practices for Java SE6? (besides upgrading to SE7 :)Teriann
For SE6: I tested setting the VM parameters zip.altEncoding or zip.encoding to Cp437 or ISO-8859-1, both did not help to read correctlyTeriann
@basZero: Apache Commons Compress works nicely. I found no out-of-the-box solution though.Copyright
C
10

I played around for two or so hours, but just five minutes after I finally posted the question here, I bumped into the answer: My zip file was not encoded with ISO-8859-1, but with Cp437. So the constructor call should be:

zipStream = new ZipInputStream(
    new BufferedInputStream(new FileInputStream(archiveFile), BUFFER_SIZE),
    Charset.forName("Cp437")
);

Now it works like a charm.

Copyright answered 30/6, 2012 at 18:11 Comment(5)
I think you can accept this answer as correct, even though you wrote it yourself, per this article: blog.stackoverflow.com/2011/07/…Inesinescapable
I have the same problem, and take me hours to solve it. Solving was very simple just use MS-DOS encoding for me cp852 instead win cp1250Trigger
Yes, that is the very same problem and the same solution, just not for the English MS-DOS code page 437 but for the Central European code page 852. Of course the exact solution always depends on the environment and tool the ZIP archive in question was generated in/with.Copyright
The Java behaviour is arguably non-conformant, as the spec seems quite clear that Cp437 is the default when the "Language encoding flag (EFS)" has not been set. "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437.... D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXTPlaypen
I upvoted your comment because the link is a very helpful resource. To be fair, Java does not claim to try and detect the encoding or even read the EFS but clearly documents that it uses UTF-8 as a default, which is understandable nowadays, especially because it is also the JAR file default. So in Java you got to know the encoding ahead of calling the the ZipInputStream constructor. Fair enough. What makes your comment so helpful is to know that Cp437 is actually a default, so this should be one of the first encodings to try when there are any problems.Copyright

© 2022 - 2024 — McMap. All rights reserved.