Java utility library for Nested ZIP file handling
Asked Answered
C

1

7

I am aware that Oracle notes ZIP/GZIP file compressor/decompressor methods on their website. But I have a scenario where I need to scan and find out whether any nested ZIPs/RARs are involved. For example, the following case:

-MyFiles.zip
   -MyNestedFiles.zip
        -MyMoreNestedFiles.zip
           -MoreProbably.zip
        -Other_non_zips
   -Other_non_zips
-Other_non_zips

I know that apache commons compress package and java.util.zip are the wideley used packages where commons compress actually caters for the missing features in java.util.zip e.g. some character setting whilst doing zipouts. But what I am not sure about is the utilities for recursing through nested zip files and the answers provided on SO are not very good examples of doing this. I tried the following code (which I got from Oracle blog), but as I suspected, the nested directory recursion fails because it simply cannot find the files:

public static void processZipFiles(String pathName) throws Exception{
        ZipInputStream zis  = null;
        InputStream  is = null;
        try {
          ZipFile zipFile = new ZipFile(new File(pathName));
          String nestPathPrefix = zipFile.getName().substring(0, zipFile.getName().length() -4);
          for(Enumeration e = zipFile.entries(); e.hasMoreElements();){
           ZipEntry ze = (ZipEntry)e.nextElement();
            if(ze.getName().contains(".zip")){
              is = zipFile.getInputStream(ze);
              zis = new ZipInputStream(is);
              ZipEntry zentry = zis.getNextEntry();

              while (zentry!=null){
                  System.out.println(zentry.getName());
                  zentry = zis.getNextEntry();
                  ZipFile nestFile = new ZipFile(nestPathPrefix+"\\"+zentry.getName());
                  if (zentry.getName().contains(".zip")) {
                      processZipFiles(nestPathPrefix+"\\"+zentry.getName());
                  }
              }
              is.close();
            }
          }
        } catch (FileNotFoundException e) {
          e.printStackTrace();
        } catch (IOException e) {
          e.printStackTrace();
        } finally{
            if(is != null)
                is.close();
            if(zis!=null)
                zis.close();
        }
    }  

May be I am doing something wrong - or using the wrong utils. My objective is to identify whether any of the files or nested zip files have got file extensions which I am not allowing. This is to make sure that I can prevent my users to upload forbidden files even when they are zipping it. I also have the option to use Tika which can do recursive parsing (Using Zukka Zitting's solution), but I am not sure if I can use the Metadata to do this detection how I want.

Any help/suggestion is appreciated.

Conchoidal answered 11/2, 2016 at 10:34 Comment(1)
Shouldn't you be opening the Nested Zip from the input stream of the outer zip entry, rather than by filename (which won't work as the file is in the zip not on the filesystem)?Siusiubhan
S
4

Using Commons Compress would be easier, not least because it has sensible shared interfaces between the various decompressors which make life easier + allows handling of other compression formats (eg Tar) at the same time

If you do want to use only the built-in Zip support, I'd suggest you do something like this:

File file = new File("outermost.zip");
FileInputStream input = new FileInputStream(file);
check(input, file.toString());

public static void check(InputStream compressedInput, String name) {
   ZipInputStream input = new ZipInputStream(compressedInput);
   ZipEntry entry = null;
   while ( (entry = input.getNextEntry()) != null ) {
      System.out.println("Found " + entry.getName() + " in " + name);
      if (entry.getName().endsWith(".zip")) { // TODO Better checking
         check(input, name + "/" + entry.getName());
      }
   }
}

Your code will fail as you're trying to read inner.zip within outer.zip as a local file, but it doesn't exist as a standalone file. The code above will process things ending with .zip as another zip file, and will recurse

You probably want to use commons compress though, so you can handle things with alternate filenames, other compression formats etc

Siusiubhan answered 11/2, 2016 at 12:39 Comment(5)
It is a simple solution, but doesn't recurse through .RAR. I tried with Tika but it takes quite long to parse the metadata (possibly because it's parsing the whole thing).Conchoidal
I can see that I can replace the ZipInputStream with ZipArchiveInputStream but which stream do I use for RAR/TAR. Should I be keeping ArchiveInputStream and ArchiveEntry all the way?Conchoidal
If you want to work with all formats with Commons Compress, use the general Archive classes. For a good example of doing that, see the Apache Tika packages parser source codeSiusiubhan
"@Gagravarr" I think the issue is that .RAR has specific licence issues which JDK doesn't have in the built-in APIs (and for that matter, neither does commons compress), but Tika seems to have it somehow through other means. It will be good to know which library it uses for RAR and whether this is part of apache foundation.Conchoidal
@ha9u63ar You can find the details in the Apache Tika Parsers pom file - it's com.github.junrar / junrarSiusiubhan

© 2022 - 2024 — McMap. All rights reserved.