Read Content from Files which are inside Zip file
Asked Answered
V

6

94

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this purpose.

Can somebody help me out here to achieve the functionality. I have tried this so far but no success

Code Snippet

public class SampleZipExtract {


    public static void main(String[] args) {

        List<String> tempString = new ArrayList<String>();
        StringBuffer sbf = new StringBuffer();

        File file = new File("C:\\Users\\xxx\\Desktop\\abc.zip");
        InputStream input;
        try {

          input = new FileInputStream(file);
          ZipInputStream zip = new ZipInputStream(input);
          ZipEntry entry = zip.getNextEntry();

          BodyContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();

          Parser parser = new AutoDetectParser();

          while (entry!= null){

                if(entry.getName().endsWith(".txt") || 
                           entry.getName().endsWith(".pdf")||
                           entry.getName().endsWith(".docx")){
              System.out.println("entry=" + entry.getName() + " " + entry.getSize());
                     parser.parse(input, textHandler, metadata, new ParseContext());
                     tempString.add(textHandler.toString());
                }
           }
           zip.close();
           input.close();

           for (String text : tempString) {
           System.out.println("Apache Tika - Converted input string : " + text);
           sbf.append(text);
           System.out.println("Final text from all the three files " + sbf.toString());
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (SAXException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (TikaException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
Vincentvincenta answered 27/3, 2013 at 18:54 Comment(4)
Why not pass the zip file straight to Apache Tika? It'll then call the recursing parser you supply for each file in the zip, so you don't have to do anything special!Peggiepeggir
That's what I was wondering but couldn't get enough tutorial in how to do that. I am also little worried about this - javamex.com/tutorials/compression/zip_problems.shtml, not sure if Tika address this issue.Vincentvincenta
Tika uses commons compress to get around a lot of those issuesPeggiepeggir
61 Mb for Tika? 61 Mb only for working with ZIP which can be done with ~10 strings?! My app with 15+ activities weights smaller than 4 Mb. I think there's a disrespection for users to have apps so big only for trivial tasks.Strickle
Z
228

If you're wondering how to get the file content from each ZipEntry it's actually quite simple. Here's a sample code:

public static void main(String[] args) throws IOException {
    ZipFile zipFile = new ZipFile("C:/test.zip");

    Enumeration<? extends ZipEntry> entries = zipFile.entries();

    while(entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
        InputStream stream = zipFile.getInputStream(entry);
    }
}

Once you have the InputStream you can read it however you want.

Zollie answered 27/3, 2013 at 19:5 Comment(5)
Don't forget to close the inputStream and the ZipFile to avoid resource leaks :).Charissecharita
zipFile.entries(); there is no entries function defined for the type zipFileMcclees
Is there a way to pass byte[] array to the constructor of ZipFile (content.getBytes())? if not how can we do this?Truesdale
@Truesdale I think the easiest way to do that is write the byte array into a new File, and give that File instance to the constructorZollie
Ultimate Solution +1Grenadine
E
61

As of Java 7, the NIO АРI provides a better and more generic way of accessing the contents of ZIP or JAR files. Actually, it is now a unified API which allows you to treat ZIP files exactly like normal files.

In order to extract all of the files contained inside of a ZIP file in this API, you'd do as shown below.

In Java 8

private void extractAll(URI fromZip, Path toDirectory) throws IOException {
    FileSystems.newFileSystem(fromZip, Collections.emptyMap())
            .getRootDirectories()
            .forEach(root -> {
                // in a full implementation, you'd have to
                // handle directories 
                Files.walk(root).forEach(path -> Files.copy(path, toDirectory));
            });
}

In Java 7

private void extractAll(URI fromZip, Path toDirectory) throws IOException {
    FileSystem zipFs = FileSystems.newFileSystem(fromZip, Collections.emptyMap());

    for (Path root : zipFs.getRootDirectories()) {
        Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) 
                    throws IOException {
                // You can do anything you want with the path here
                Files.copy(file, toDirectory);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) 
                    throws IOException {
                // In a full implementation, you'd need to create each 
                // sub-directory of the destination directory before 
                // copying files into it
                return super.preVisitDirectory(dir, attrs);
            }
        });
    }
}
Erubescent answered 24/5, 2016 at 12:25 Comment(4)
This is both awesome and insane.Guilbert
FileSystem should be closed after the operation.Escolar
In the java 8 version, Files.walk(root) throws IOException which can't propagate through the lambda.Torse
Use try-with-resources!Capitulary
H
11

Because of the condition in while, the loop might never break:

while (entry != null) {
  // If entry never becomes null here, loop will never break.
}

Instead of the null check there, you can try this:

ZipEntry entry = null;
while ((entry = zip.getNextEntry()) != null) {
  // Rest of your code
}
Hailstorm answered 27/3, 2013 at 19:0 Comment(2)
Can't we just use while (zip.getNextEntry() != null) ??Vasoinhibitor
@Vasoinhibitor hopefully you've tried this and realized that there wouldn't be a reference to the ZipEntry for use inside the while block. This would also work if you'd prefer: ZipEntry entry = zip.getNextEntry(); while (entry !=null) { /* do stuff */ entry = zip.getNextEntry(); }Erikerika
B
3

Sample code you can use to let Tika take care of container files for you. http://wiki.apache.org/tika/RecursiveMetadata

Form what I can tell, the accepted solution will not work for cases where there are nested zip files. Tika, however will take care of such situations as well.

Bronny answered 24/12, 2013 at 22:38 Comment(0)
E
2

My way of achieving this is by creating ZipInputStream wrapping class that would handle that would provide only the stream of current entry:

The wrapper class:

public class ZippedFileInputStream extends InputStream {

    private ZipInputStream is;

    public ZippedFileInputStream(ZipInputStream is){
        this.is = is;
    }

    @Override
    public int read() throws IOException {
        return is.read();
    }

    @Override
    public void close() throws IOException {
        is.closeEntry();
    }

}

The use of it:

    ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream("SomeFile.zip"));

    while((entry = zipInputStream.getNextEntry())!= null) {

     ZippedFileInputStream archivedFileInputStream = new ZippedFileInputStream(zipInputStream);

     //... perform whatever logic you want here with ZippedFileInputStream 

     // note that this will only close the current entry stream and not the ZipInputStream
     archivedFileInputStream.close();

    }
    zipInputStream.close();

One advantage of this approach: InputStreams are passed as an arguments to methods that process them and those methods have a tendency to immediately close the input stream after they are done with it.

Emlin answered 19/2, 2016 at 16:46 Comment(0)
S
0

i did mine like this and remember to change url or zip files jdk 15

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;
import java.util.stream.Stream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import java.io.*;
import java.util.*;
import java.nio.file.Paths;

class Main {
  public static void main(String[] args) throws MalformedURLException,FileNotFoundException,IOException{
    String url,kfile;
    Scanner getkw = new Scanner(System.in);
    System.out.println(" Please Paste Url ::");
    url = getkw.nextLine();
    System.out.println("Please enter name of file you want to save as :: ");
    kfile = getkw.nextLine();
    getkw.close();
    Main Dinit = new Main();
    System.out.println(Dinit.dloader(url, kfile));
    ZipFile Vanilla = new ZipFile(new File("Vanilla.zip"));
    Enumeration<? extends ZipEntry> entries = Vanilla.entries();

    while(entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
//        String nextr =  entries.nextElement();
        InputStream stream = Vanilla.getInputStream(entry);
        FileInputStream inpure= new FileInputStream("Vanilla.zip");
        FileOutputStream outter = new FileOutputStream(new File(entry.toString()));
        outter.write(inpure.readAllBytes());
        outter.close();
    }

  }
  private String dloader(String kurl, String fname)throws IOException{
    String status ="";
    try {
      URL url = new URL("URL here");
      FileOutputStream out = new FileOutputStream(new File("Vanilla.zip"));         // Output File
      out.write(url.openStream().readAllBytes());
      out.close();
    } catch (MalformedURLException e) {
      status = "Status: MalformedURLException Occured";
    }catch (IOException e) {
      status = "Status: IOexception Occured";
    }finally{
      status = "Status: Good";}
    String path="\\tkwgter5834\\";
    extractor(fname,"tkwgter5834",path);
    

    return status;
  }
  private String extractor(String fname,String dir,String path){
    File folder = new File(dir);
    if(!folder.exists()){
      folder.mkdir();
    }
    return "";
  }
}
Storytelling answered 6/4, 2021 at 18:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.