Java, unzip folder with German characters in filenames
Asked Answered
B

1

5

I'm trying to unzip folder that contains German characters in it, for example Aufhänge . I know that in Java 7, it is using utf-8 by default, and i think "ä" is one of the utf-8 characters. Here is my code snippet

public static void main(String[] args) throws IOException {
    ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);
    ZipEntry zipEntry;
    while ((zipEntry = zipInputStream.getNextEntry()) != null) {
        System.out.println(zipEntry.getName());
    }
}

This is an error that I get: java.lang.IllegalArgumentException: MALFORMED

It works with Charset.forName("Cp437"), but it doesn't work with StandardCharsets.UTF_8

Brewis answered 28/3, 2019 at 9:18 Comment(7)
Can you please share the ZIP file if it does not contain any personal informationJacey
its very easy to reproduce, just create text file with name Aufhänge and zip itBrewis
I had tried with it but I am not getting an errorJacey
I am usign JAVA8 for compilationJacey
I'm using Java 8 too. Maybe problem can be in Operating System. I'm using windows 10. I think on linux filenames are encoded in different way.Brewis
I am also using windows 10Jacey
We have reproduced this problem with Java 8 and 7zip when there were many files with special characters in the zip, not just 1 file.Salpingotomy
D
18

You don't mention your operating system, nor how you created the zip file, but I managed to recreate your problem anyway, using 7-Zip on Windows 10:

  • Create a simple text file with some trivial content (e.g. nothing but the three characters "abc").
  • Save the file as D:\Temp\Aufhänge.txt. Note the umlaut in the file name.
  • Locate that file in Windows File Explorer.
  • Select the file and right click. From the context menu select 7-Zip > Add to "Aufhänge.zip" to create Aufhänge.zip.

Then, in NetBeans run the following code to unzip the file you just created:

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

public class GermanZip {

    static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";

    public static void main(String[] args) throws FileNotFoundException, IOException {

        ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));
        ZipEntry zipEntry;
        while ((zipEntry = zipInputStream.getNextEntry()) != null) {
            System.out.println(zipEntry.getName());
        }
    }

}

As you pointed out, the code throws java.lang.IllegalArgumentException: MALFORMED when executing this statement: zipEntry = zipInputStream.getNextEntry()) != null.

The problem arises because by default 7-Zip encodes the names of the files within the zip file using Cp437, as noted in this comment from 7-Zip:

Default encoding is OEM (DOS) encoding. It's for compatibility with old zip software.

That's why the unzip works when using Charset.forName("Cp437") instead of Charset.forName("UTF-8").

If you want to unzip using Charset.forName("UTF-8") then you have to force 7-Zip to encode the filenames within the zip in UTF-8. To do this specify the cu parameter when running 7-Zip, as noted in the linked comment:

  • In Windows File Explorer select the file and right click.
  • From the context menu select 7-Zip > Add to Archive...".
  • In the Add to Archive dialog specify cu in the Parameters field:

    AddToArchive

  • Having stored the zipped filenames in UTF-8 format, you can then replace Charset.forName("Cp437") with Charset.forName("UTF-8") in your code, and no exception will be thrown when unzipping.

This answer is specific to Windows 10 and 7-Zip, but the general principle should apply in any environment: if specifying an encoding of UTF-8 for your ZipInputStream be certain that the filenames within the zip file really are encoded using UTF-8. You can easily verify this by opening the zip file in a binary editor and searching for the names of the zipped files.


Update based on OP's comment/question below:

  • Unfortunately the .ZIP File Format Specification does not currently provide a way to store the encoding used for zipped file names apart from one exception, as described in "APPENDIX D - Language Encoding (EFS)":

    D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment MUST support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

  • So in your code, for each zipped file, first check whether bit 11 of the general purpose bit flag is set. If it is then you can be certain that the name of that zipped fie is encoded using UTF-8. Otherwise the encoding is whatever was used when the zipped file was created. That is Cp437 by default on Windows, but if you are running on Windows and processing a zip file created on Linux I don't think there is an easy way of determining the encoding(s) used.

  • Unfortunately ZipEntry does not provide a method to access the general purpose bit flag field of a zipped file, so you would need to process the zip file at the byte level to do that.
  • To add a further complication, "encoding" in this context relates to the encoding used for each zipped filename rather than for the zip file itself. One zipped file name could be encoded in UTF-8, another zipped file name could have been added using Cp437, etc.
Darnell answered 2/5, 2019 at 4:37 Comment(2)
Thank you, however, I'm writing client for general public, which means some zips will be encoded with Cp437 and some with other encodings. Should I just try one encoding, if wrong, catch exception and then try another one and so on? there is no better way?Brewis
@LukaGorgadze I've updated my answer. If the zip file was created on Windows then assume UTF-8 if bit 11 of the general purpose bit flag is set, as described in my update. Otherwise assume Cp437, but also catch any IllegalArgumentException, and research the encoding used to create that zipped file. Then enhance your code to cater for other encodings as needed. See How can I correctly decompress a ZIP archive of files with Hebrew names? for more information.Darnell

© 2022 - 2024 — McMap. All rights reserved.