How to split a huge zip file into multiple volumes?
Asked Answered
G

5

16

When I create a zip Archive via java.util.zip.*, is there a way to split the resulting archive in multiple volumes?

Let's say my overall archive has a filesize of 24 MB and I want to split it into 3 files on a limit of 10 MB per file.
Is there a zip API which has this feature? Or any other nice ways to achieve this?

Thanks Thollsten

Gradey answered 28/10, 2008 at 16:42 Comment(0)
V
12

Check: http://saloon.javaranch.com/cgi-bin/ubb/ultimatebb.cgi?ubb=get_topic&f=38&t=004618

I am not aware of any public API that will help you do that. (Although if you do not want to do it programatically, there are utilities like WinSplitter that will do it)

I have not tried it but, every ZipEntry while using ZippedInput/OutputStream has a compressed size. You may get a rough estimate of the size of the zipped file while creating it. If you need 2MB of zipped files, then you can stop writing to a file after the cumulative size of entries become 1.9MB, taking .1MB for Manifest file and other zip file specific elements. So, in a nutshell, you can write a wrapper over the ZippedInputStream as follows:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

public class ChunkedZippedOutputStream {

    private ZipOutputStream zipOutputStream;

    private final String path;
    private final String name;

    private long currentSize;
    private int currentChunkIndex;
    private final long MAX_FILE_SIZE = 16000000; // Whatever size you want
    private final String PART_POSTFIX = ".part.";
    private final String FILE_EXTENSION = ".zip";

    public ChunkedZippedOutputStream(String path, String name) throws FileNotFoundException {
        this.path = path;
        this.name = name;
        constructNewStream();
    }

    public void addEntry(ZipEntry entry) throws IOException {
        long entrySize = entry.getCompressedSize();
        if ((currentSize + entrySize) > MAX_FILE_SIZE) {
            closeStream();
            constructNewStream();
        } else {
            currentSize += entrySize;
            zipOutputStream.putNextEntry(entry);
        }
    }

    private void closeStream() throws IOException {
        zipOutputStream.close();
    }

    private void constructNewStream() throws FileNotFoundException {
        zipOutputStream = new ZipOutputStream(new FileOutputStream(new File(path, constructCurrentPartName())));
        currentChunkIndex++;
        currentSize = 0;
    }

    private String constructCurrentPartName() {
        // This will give names is the form of <file_name>.part.0.zip, <file_name>.part.1.zip, etc.
        return name + PART_POSTFIX + currentChunkIndex + FILE_EXTENSION;
    }
}

The above program is just a hint of the approach and not a final solution by any means.

Virgiliovirgin answered 28/10, 2008 at 16:52 Comment(2)
This is just several separate zip files, right? It's not a single multi-volume zipfile.Cheeseburger
We got this code working in the answers to this question - #11105389Hymnology
R
5

If the goal is to have the output be compatible with pkzip and winzip, I'm not aware of any open source libraries that do this. We had a similar requirement for one of our apps, and I wound up writing our own implementation (compatible with the zip standard). If I recall, the hardest thing for us was that we had to generate the individual files on the fly (the way that most zip utilities work is they create the big zip file, then go back and split it later - that's a lot easier to implement. Took about a day to write and 2 days to debug.

The zip standard explains what the file format has to look like. If you aren't afraid of rolling up your sleeves a bit, this is definitely doable. You do have to implement a zip file generator yourself, but you can use Java's Deflator class to generate the segment streams for the compressed data. You'll have to generate the file and section headers yourself, but they are just bytes - nothing too hard once you dive in.

Here's the zip specification - section K has the info you are looking for specifically, but you'll need to read A, B, C and F as well. If you are dealing with really big files (We were), you'll have to get into the Zip64 stuff as well - but for 24 MB, you are fine.

If you want to dive in and try it - if you run into questions, post back and I'll see if I can provide some pointers.

Ribaudo answered 29/10, 2008 at 2:34 Comment(1)
I'm having problems with multi-volume zip files. Specifically when a single file component is split between more than disk file. In file.zx01 I have the file header and the first part of the compressed data, then in file.zx02 I have the rest of the compressed data. But I'm not able to reassemble the files for some reason, and I'm not sure why. Do you have any experience here?Omdurman
J
1

For what it's worth, I like to use try-with-resources everywhere. If you are into that design pattern, then you will like this. Also, this solves the problem of empty parts if the entries are larger than the desired part size. You will at least have as many parts as entries in the worst case.

In:

my-archive.zip

Out:

my-archive.part1of3.zip
my-archive.part2of3.zip
my-archive.part3of3.zip

Note: I'm using logging and Apache Commons FilenameUtils, but feel free to use what you have in your toolkit.

/**
 * Utility class to split a zip archive into parts (not volumes)
 * by attempting to fit as many entries into a single part before
 * creating a new part. If a part would otherwise be empty because
 * the next entry won't fit, it will be added anyway to avoid empty parts.
 *
 * @author Eric Draken, 2019
 */
public class Zip
{
    private static final int DEFAULT_BUFFER_SIZE = 1024 * 4;

    private static final String ZIP_PART_FORMAT = "%s.part%dof%d.zip";

    private static final String EXT = "zip";

    private static final Logger logger = LoggerFactory.getLogger( MethodHandles.lookup().lookupClass() );

    /**
     * Split a large archive into smaller parts
     *
     * @param zipFile             Source zip file to split (must end with .zip)
     * @param outZipFile          Destination zip file base path. The "part" number will be added automatically
     * @param approxPartSizeBytes Approximate part size
     * @throws IOException Exceptions on file access
     */
    public static void splitZipArchive(
        @NotNull final File zipFile,
        @NotNull final File outZipFile,
        final long approxPartSizeBytes ) throws IOException
    {
        String basename = FilenameUtils.getBaseName( outZipFile.getName() );
        Path basePath = outZipFile.getParentFile() != null ? // Check if this file has a parent folder
            outZipFile.getParentFile().toPath() :
            Paths.get( "" );
        String extension = FilenameUtils.getExtension( zipFile.getName() );
        if ( !extension.equals( EXT ) )
        {
            throw new IllegalArgumentException( "The archive to split must end with ." + EXT );
        }

        // Get a list of entries in the archive
        try ( ZipFile zf = new ZipFile( zipFile ) )
        {
            // Silliness check
            long minRequiredSize = zipFile.length() / 100;
            if ( minRequiredSize > approxPartSizeBytes )
            {
                throw new IllegalArgumentException(
                    "Please select a minimum part size over " + minRequiredSize + " bytes, " +
                        "otherwise there will be over 100 parts."
                );
            }

            // Loop over all the entries in the large archive
            // to calculate the number of parts required
            Enumeration<? extends ZipEntry> enumeration = zf.entries();
            long partSize = 0;
            long totalParts = 1;
            while ( enumeration.hasMoreElements() )
            {
                long nextSize = enumeration.nextElement().getCompressedSize();
                if ( partSize + nextSize > approxPartSizeBytes )
                {
                    partSize = 0;
                    totalParts++;
                }
                partSize += nextSize;
            }

            // Silliness check: if there are more parts than there
            // are entries, then one entry will occupy one part by contract
            totalParts = Math.min( totalParts, zf.size() );

            logger.debug( "Split requires {} parts", totalParts );
            if ( totalParts == 1 )
            {
                // No splitting required. Copy file
                Path outFile = basePath.resolve(
                    String.format( ZIP_PART_FORMAT, basename, 1, 1 )
                );
                Files.copy( zipFile.toPath(), outFile );
                logger.debug( "Copied {} to {} (pass-though)", zipFile.toString(), outFile.toString() );
                return;
            }

            // Reset
            enumeration = zf.entries();

            // Split into parts
            int currPart = 1;
            ZipEntry overflowZipEntry = null;
            while ( overflowZipEntry != null || enumeration.hasMoreElements() )
            {
                Path outFilePart = basePath.resolve(
                    String.format( ZIP_PART_FORMAT, basename, currPart++, totalParts )
                );
                overflowZipEntry = writeEntriesToPart( overflowZipEntry, zf, outFilePart, enumeration, approxPartSizeBytes );
                logger.debug( "Wrote {}", outFilePart );
            }
        }
    }

    /**
     * Write an entry to the to the outFilePart
     *
     * @param overflowZipEntry    ZipEntry that didn't fit in the last part, or null
     * @param inZipFile           The large archive to split
     * @param outFilePart         The part of the archive currently being worked on
     * @param enumeration         Enumeration of ZipEntries
     * @param approxPartSizeBytes Approximate part size
     * @return Overflow ZipEntry, or null
     * @throws IOException File access exceptions
     */
    private static ZipEntry writeEntriesToPart(
        @Nullable ZipEntry overflowZipEntry,
        @NotNull final ZipFile inZipFile,
        @NotNull final Path outFilePart,
        @NotNull final Enumeration<? extends ZipEntry> enumeration,
        final long approxPartSizeBytes
    ) throws IOException
    {
        try (
            ZipOutputStream zos =
                new ZipOutputStream( new FileOutputStream( outFilePart.toFile(), false ) )
        )
        {
            long partSize = 0;
            byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
            while ( overflowZipEntry != null || enumeration.hasMoreElements() )
            {
                ZipEntry entry = overflowZipEntry != null ? overflowZipEntry : enumeration.nextElement();
                overflowZipEntry = null;

                long entrySize = entry.getCompressedSize();
                if ( partSize + entrySize > approxPartSizeBytes )
                {
                    if ( partSize != 0 )
                    {
                        return entry;    // Finished this part, but return the dangling ZipEntry
                    }
                    // Add the entry anyway if the part would otherwise be empty
                }
                partSize += entrySize;
                zos.putNextEntry( entry );

                // Get the input stream for this entry and copy the entry
                try ( InputStream is = inZipFile.getInputStream( entry ) )
                {
                    int bytesRead;
                    while ( (bytesRead = is.read( buffer )) != -1 )
                    {
                        zos.write( buffer, 0, bytesRead );
                    }
                }
            }
            return null;    // Finished splitting
        }
    }
Jahnke answered 29/4, 2019 at 22:12 Comment(0)
P
0

Below code is my solution to split zip file in directory structure to chunks based on desired size. I found the previous answers useful so, wanted to contribute with similar but little more neat approach. This code is working for me for my specific needs, and I believe there is room for improvement.

import java.io.*;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;

class ChunkedZip {
    private final static long MAX_FILE_SIZE = 1000 * 1000 * 1024; //  around 1GB 
    private final static String zipCopyDest = "C:\\zip2split\\copy";

    public static void splitZip(String zipFileName, String zippedPath, String coreId) throws IOException {

        System.out.println("process whole zip file..");
        FileInputStream fis = new FileInputStream(zippedPath);
        ZipInputStream zipInputStream = new ZipInputStream(fis);
        ZipEntry entry = null;
        int currentChunkIndex = 0;
        //using just to get the uncompressed size of the zipentries
        long entrySize = 0;
        ZipFile zipFile = new ZipFile(zippedPath);
        Enumeration enumeration = zipFile.entries();

        String copDest = zipCopyDest + "\\" + coreId + "_" + currentChunkIndex + ".zip";

        FileOutputStream fos = new FileOutputStream(new File(copDest));
        BufferedOutputStream bos = new BufferedOutputStream(fos);
        ZipOutputStream zos = new ZipOutputStream(bos);
        long currentSize = 0;

        try {
            while ((entry = zipInputStream.getNextEntry()) != null && enumeration.hasMoreElements()) {

                ZipEntry zipEntry = (ZipEntry) enumeration.nextElement();
                System.out.println(zipEntry.getName());
                System.out.println(zipEntry.getSize());
                entrySize = zipEntry.getSize();

                ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
                //long entrySize = entry.getCompressedSize();
                //entrySize = entry.getSize(); //gives -1

                if ((currentSize + entrySize) > MAX_FILE_SIZE) {
                    zos.close();
                    //construct a new stream
                    //zos = new ZipOutputStream(new FileOutputStream(new File(zippedPath, constructCurrentPartName(coreId))));
                    currentChunkIndex++;
                    zos = getOutputStream(currentChunkIndex, coreId);
                    currentSize = 0;

                } else {
                    currentSize += entrySize;
                    zos.putNextEntry(new ZipEntry(entry.getName()));
                    byte[] buffer = new byte[8192];
                    int length = 0;
                    while ((length = zipInputStream.read(buffer)) > 0) {
                        outputStream.write(buffer, 0, length);
                    }

                    byte[] unzippedFile = outputStream.toByteArray();
                    zos.write(unzippedFile);
                    unzippedFile = null;
                    outputStream.close();
                    zos.closeEntry();
                }
                //zos.close();
            }
        } finally {
            zos.close();
        }
    }

    public static ZipOutputStream getOutputStream(int i, String coreId) throws IOException {
        System.out.println("inside of getOutputStream()..");
        ZipOutputStream out = new ZipOutputStream(new FileOutputStream(zipCopyDest + "\\" + coreId + "_" + i + ".zip"));
        // out.setLevel(Deflater.DEFAULT_COMPRESSION);
        return out;
    }

    public static void main(String args[]) throws IOException {
        String zipFileName = "Large_files_for_testing.zip";
        String zippedPath = "C:\\zip2split\\Large_files_for_testing.zip";
        String coreId = "Large_files_for_testing";
        splitZip(zipFileName, zippedPath, coreId);
    }
}
Peccadillo answered 6/7, 2018 at 16:10 Comment(0)
U
0

Here's my solution:

public abstract class ZipHelper {

    public static NumberFormat formater = NumberFormat.getNumberInstance(new Locale("pt", "BR"));

    public static List<Path> zip(Collection<File> inputFiles, long maxSize) throws IOException {

        byte[] buffer = new byte[1024];
        int count = 0;
        long currentZipSize = maxSize;
        List<Path> response = new ArrayList<>();
        ZipOutputStream zip = null;
        for (File currentFile : inputFiles) {
            long nextFileSize = currentFile.length();
            long predictedZipSize = currentZipSize + nextFileSize;
            boolean needNewFile = predictedZipSize >= maxSize;
            System.out.println("[=] ZIP current (" + formater.format(currentZipSize) + ") + next file (" + formater.format(nextFileSize) + ") = predicted (" + formater.format(predictedZipSize) + ") > max (" + formater.format(maxSize) + ") ? " + needNewFile);
            if (needNewFile) {
                safeClose(zip);
                Path tmpFile = Files.createTempFile("teste-", (".part." + count++ + ".zip"));
                System.out.println("[#] Starting new file: " + tmpFile);
                zip = new ZipOutputStream(Files.newOutputStream(tmpFile));
                zip.setLevel(Deflater.BEST_COMPRESSION);
                response.add(tmpFile);
                currentZipSize = 0;
            }
            ZipEntry zipEntry = new ZipEntry(currentFile.getName());
            System.out.println("[<] Adding to ZIP: " + currentFile.getName());
            zip.putNextEntry(zipEntry);
            FileInputStream in = new FileInputStream(currentFile);
            zip.write(in.readAllBytes());
            zip.closeEntry();
            safeClose(in);
            long compressed = zipEntry.getCompressedSize();
            System.out.println("[=] Compressed current file: " + formater.format(compressed));
            currentZipSize += zipEntry.getCompressedSize();
        }
        safeClose(zip);
        return response;
    }

    public static void safeClose(Closeable... closeables) {
        if (closeables != null) {
            for (Closeable closeable : closeables) {
                if (closeable != null) {
                    try {
                        System.out.println("[X] Closing: (" + closeable.getClass() + ") - " + closeable);
                        closeable.close();
                    } catch (Throwable ex) {
                        System.err.println("[!] Error on close: " + closeable);
                        ex.printStackTrace();
                    }
                }
            }
        }
    }
}

And the console output:

[?] Files to process: [\data\teste\TestFile(1).pdf, \data\teste\TestFile(2).pdf, \data\teste\TestFile(3).pdf, \data\teste\TestFile(4).pdf, \data\teste\TestFile(5).pdf, \data\teste\TestFile(6).pdf, \data\teste\TestFile(7).pdf]
[=] ZIP current (3.145.728) + next file (1.014.332) = predicted (4.160.060) > max (3.145.728) ? true
[#] Starting new file: C:\Users\Cassio\AppData\Local\Temp\teste-3319961516431535912.part.0.zip
[<] Adding to ZIP: TestFile(1).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@3d99d22e
[=] Compressed current file: 940.422
[=] ZIP current (940.422) + next file (1.511.862) = predicted (2.452.284) > max (3.145.728) ? false
[<] Adding to ZIP: TestFile(2).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@49fc609f
[=] Compressed current file: 1.475.178
[=] ZIP current (2.415.600) + next file (2.439.287) = predicted (4.854.887) > max (3.145.728) ? true
[X] Closing: (class java.util.zip.ZipOutputStream) - java.util.zip.ZipOutputStream@cd2dae5
[#] Starting new file: C:\Users\Cassio\AppData\Local\Temp\teste-8849887746791381380.part.1.zip
[<] Adding to ZIP: TestFile(3).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@4973813a
[=] Compressed current file: 2.374.718
[=] ZIP current (2.374.718) + next file (2.385.447) = predicted (4.760.165) > max (3.145.728) ? true
[X] Closing: (class java.util.zip.ZipOutputStream) - java.util.zip.ZipOutputStream@6321e813
[#] Starting new file: C:\Users\Cassio\AppData\Local\Temp\teste-6305809161676875106.part.2.zip
[<] Adding to ZIP: TestFile(4).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@79be0360
[=] Compressed current file: 2.202.203
[=] ZIP current (2.202.203) + next file (292.918) = predicted (2.495.121) > max (3.145.728) ? false
[<] Adding to ZIP: TestFile(5).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@22a67b4
[=] Compressed current file: 230.491
[=] ZIP current (2.432.694) + next file (4.197.512) = predicted (6.630.206) > max (3.145.728) ? true
[X] Closing: (class java.util.zip.ZipOutputStream) - java.util.zip.ZipOutputStream@57855c9a
[#] Starting new file: C:\Users\Cassio\AppData\Local\Temp\teste-17160527941340008316.part.3.zip
[<] Adding to ZIP: TestFile(6).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@3b084709
[=] Compressed current file: 3.020.115
[=] ZIP current (3.020.115) + next file (1.556.237) = predicted (4.576.352) > max (3.145.728) ? true
[X] Closing: (class java.util.zip.ZipOutputStream) - java.util.zip.ZipOutputStream@3224f60b
[#] Starting new file: C:\Users\Cassio\AppData\Local\Temp\teste-14050058835776413808.part.4.zip
[<] Adding to ZIP: TestFile(7).pdf
[X] Closing: (class java.io.FileInputStream) - java.io.FileInputStream@63e2203c
[=] Compressed current file: 1.460.566
[X] Closing: (class java.util.zip.ZipOutputStream) - java.util.zip.ZipOutputStream@1efed156
[>] Generated ZIP files(s): [C:\Users\Cassio\AppData\Local\Temp\teste-3319961516431535912.part.0.zip, C:\Users\Cassio\AppData\Local\Temp\teste-8849887746791381380.part.1.zip, C:\Users\Cassio\AppData\Local\Temp\teste-6305809161676875106.part.2.zip, C:\Users\Cassio\AppData\Local\Temp\teste-17160527941340008316.part.3.zip, C:\Users\Cassio\AppData\Local\Temp\teste-14050058835776413808.part.4.zip]
Ultramontanism answered 21/1, 2021 at 14:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.