Multithreaded Unzipping In Java
Asked Answered
H

3

7

So, I'm trying to do read-only access of a zip file in Java, decompressing in a multithreaded manner, because my standard simple singly-threaded solution of ZipFile/ZipEntry using the enumeration and inputstreams and what-not results in it taking about five full seconds just to decompress into memory a 50-meg zipfile which takes one second AT MOST for my disk to read without decompressing.

However, the entire Java zip library is synchronized to an incredibly-obnoxious degree, no doubt because it's all abstracted for reading/writing/etc. in the same code instead of having nice efficient non-synchronized read-only code.

I've looked at third-party Java libraries, and all of them are either massive VFS libraries that are worse than using an elephant gun to shoot a fly, or else the only reason they have a performance benefit is they multithread to the extent that most of the threads are blocking on disk IO anyway.

All I want to do is pull a zipfile into a byte[], fork some threads, and work on it. There's no reason any synchronization would be needed in any way for anything, because each of the unzipped files I use separately in memory without interaction.

Why must this be so difficult?

Handcart answered 21/12, 2013 at 10:17 Comment(10)
Do you buffer your input and output?Cassiani
what do you mean by " pull a zipfile into a byte[]" ? do you mean you read it all to a byte array? if so, is it the compressed format or the decompressed format? and what would you do with files that are just too large to hold everything into a byte array? i think that if you don't put it all into the RAM, you will be back to the problem of a single file to be used by multiple threads, which will usually not give you any advantage over a single thread.Athletic
None of these files are going to be huge: They will all fit into memory with oodles of room to spare without issue. But the random-access nature of working with the zip file is killing performance, whereas reading the entire file at once and then working with it in memory would be so much better, because hard drives are amazing at optimizing reads like that.Handcart
From looking at the code in the JDK source archive, this would work perfectly if I were to simply delete the "synchronized" keyword throughout the native library. The fact that there is no non-synchronized version of this to be used in read-only mode is annoying.Handcart
Have you tried simply reading and inflating the file into RAM without using ZipFile - just use a ZipInputStream.Persevering
@BoristheSpider But that would work only for single-file ZIPs, wouldn't it?Allophane
Reading and inflating the file into RAM is only a partial improvement: It speeds up the disk IO, but it doesn't allow me to multithread the decompression. I can't multithread at all until I have pulled the files out of the zip in decompressed format.Handcart
@Handcart i still don't see much point at decompressing using multiple threads, as the major bottleneck here isn't the CPU but the disk itself. maybe if you had a disk that is faster than a CPU, it would make sense, but hardware these days do not function this way...Athletic
Decompression is CPU-intensive, and the way it works now, it has to be done completely in lockstep with the disk IO. Think of it this way: Disk IO for zipfile takes M seconds; single-threaded decompression for zipfile takes N seconds, total time M+N. If I could multithread the decompression over X multiple threads, the time becomes M+N/X. Do you see now?Handcart
@Handcart sure. it also depands on the compression technique and how many files are inside (especially in case you actually decompress the data into files). i think that usually M is larger than N , and usually by a large factor. anyway, i think whichever solution you find, it can't be a global solution for all file types, as each of them work in a different way, and some might not even allow such a thing due to their nature of compression technique. it's a good thing though to optimize it in this direction too, of course.Athletic
I
3

The fastest way to archieve this with Java is to use NIO. You can directly map the file into memory by using a MappedByteBuffer.

FileChannel channel = FileChannel.open(Paths.get("/path/to/zip"),
    StandardOpenOption.READ);
MappedByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());

Now buffer contains a memory mapped region of your whole file. You can do whatever you want with it, for example pass an offset and a length to a thread. I don't know which zip lib supports that, but apparently you have something like that already.

FYI, I tested a bit with a 50 mb single file archive and it took less than 200ms on average to read it with a usual ZipInputStream- I think that you are trying to optimize pretty much nothing here.

Inceptive answered 21/12, 2013 at 12:19 Comment(7)
@Handcart You can also layer a ZipInputStream over the buffer by using a bit more code that bridges the buffer to the input stream. See: https://mcmap.net/q/322376/-wrapping-a-bytebuffer-with-an-inputstreamInceptive
But what good can that bring? Memory-mapped file's content is still on disk until accessed, and the difference between a low-level API to directly access the RAM pages of the disk cache, as opposed going through the regular file APIs to reach those same pages, is quite insignificant. Wrapping into a sequential InputStream disposes of any advantage that may have been there with the random-access file.Allophane
As for native ZIP API's, surely they don't work on Java arrays.Allophane
@MarkoTopolnik The main advantage is that you don't need another copy to an Java internal byte array. Using a BufferedInputStream underlying the ZipInputStream takes a bit longer (200ms vs. 270ms; 600ms unbuffered). For his multithreading plans, yes obviously the sequential InputStream is useless. BTW I still don't know what kind of parallelism he wants: Thread per file?Inceptive
I'm comparing straightforward file I/O (for example, a RandomAccessFile) with a memory-mapped file. The difference there is only in skipping some system call overhead. Yes, apparently he wants to parallelize decompression. Optimally, one thread would be sequentially reading/parsing the archive file into individual compressed file chunks and submitting each chunk to an ExecutorService to exploit the concurrency.Allophane
Yes, that is the ideal split, one thread breaking the zip by file and blocking on disk IO, then worker threads one per file decompressing and processing each file separately.Handcart
So you say there's no 3rd party api which would have more granularity and expose the decompression step as a separate concern? That may in fact be an indication that this really doesn't bring any tangible benefits.Allophane
H
3

Just for posterity's sake, after some testing back and forth, the answer I finally ended up using goes as follows (with complete iterations starting from scratch with closed files in a while (true) loop):

  • Use DataInputStream.readFully to pull the entire (50 meg, in this case) zip file into a byte[].

  • Spawn worker threads (one per physical CPU core, 4 in my case) which each take that byte[] and create a ZipInputStream(ByteArrayInputStream). The first worker skips 0 entries, the second skips 1, the second skips 2, etc., so they're all offset from each other by one. The worker threads do not synchronize at all, so they all have their own local copies of the zip file's metadata and what-not. This is thread-safe because the zip file is read-only and the workers are not sharing decompressed data.

  • Each worker thread reads an entry and processes it, and then skips enough entries so that they all again are offset by one. So the first thread reads entries 0,4,8..., the second reads 1,5,9..., and so forth.

  • All the workers are pulled back in with .join().

My times were as follows:

  • Reading the zip file into the byte[] with no unzipping at all (just the IO) gives an average of 0.1 sec for every iteration.

  • Using straight ZipFile directly on the underlying file as normal, yields an initial spike of 0.5 sec, followed by an average of 0.26 sec for each iteration thereafter (starting from fresh after closing the previous ZipFile).

  • Reading the ZipFile into a byte[], creating a ZipInputStream(ByteArrayInputStream) with it with no multithreading at all, results in an initial spike of 0.3 sec, followed by an average of 0.26 sec for each iteration thereafter, showing that the disk caching was having an effect rendering random-access and initial-read equivalent.

  • Reading the ZipFile into a byte[], spawning 4 worker threads with that byte[] as described above, and waiting for them to finish, brought the time back down to an average of 0.1 sec for every iteration.

So, the verdict is, by this method I have successfully brought the processing of a moderately-sized zipfile with a moderately-powerful computer down to the time it takes to simply physically read the file, with the additional decompression step no longer noticeable at all. Obviously this same method on a huge zip file with tens of thousands of entries would still yield a massive speedup.

Seems I wasn't trying to optimize away nothing, considering I reduced the processing time for my sample file (which is around the size of the biggest one I'll need to work with) to 38% of the simple singly-threaded method.

Considering how incredibly well this hack-job worked, imagine the possible speedup with a native Java zip-reader class actually designed to do this without the synchronization built in.

Handcart answered 22/12, 2013 at 1:28 Comment(0)
G
1

As you noticed, all the methods in ZipFile are synchronized. But that only stops multiple threads from running at the same time across different ZipFile instances, opened for the same exact zipfile on disk.

If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.

Protip: if you really care about speed, you can get more performance by reading all the ZipEntry objects from the first ZipFile instance, and sharing them with all threads, to avoid duplicating work in reading ZipEntry objects for each thread separately. A ZipEntry object is not tied to a specific ZipFile instance per se, ZipEntry just records metadata that will work with any ZipFile object representing the same zipfile that the ZipEntry came from.

Grimbald answered 9/7, 2018 at 22:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.