Is it safe to call ICsharpCode.SharpZipLib in parallel on multiple threads

T

3

5

We are currently using for compression the GZipOutputStream class of ICsharpCode.SharpZipLib library. We do it from a single thread.

I want to split my input data stream into chunks and compress them in parallel. I'm worried though that this library may have some statics inside which will be overwritten from multiple threads and therefore corrupt the resulting stream.

Any thoughts will be appreciated.

Tolerant answered 5/5, 2011 at 16:37 Comment(3)

Why would you want to do that ? What are you doing/planning to do with the resulting stream ? – Thuthucydides 5/5, 2011 at 19:43

I suppose he'd want to do it to improve the overall throughput of the compression, since it's a very CPU intensive operation. – Payload 6/5, 2011 at 17:58

To Martin: Each chunk with its order number will be stored on hard drive, most likely, in a database, may also be concatenated into a file with index. When we read these chunks, we decompress them individually and then concatenate them in the proper order to get the original stream. – Tolerant 7/5, 2011 at 13:21

P

0

It is standard practice to make sure all static members are thread-safe when coding classes. So I would think it is very unlikely that you would have a problem due to that issue. Of course, if you plan on using the same GZipOutputStream from different threads then that would definitely be problematic since instance members of that class are not thread-safe.

What you might be able to do is to create a thread-safe middleman Stream class (think decorator pattern) and pass that to the GZipOutputStream. This custom stream class, call it ThreadSafeStream, would itself accept a Stream instance and would use the appropriate mechanisms to synchronize access to it.

You will create one GZipOutputStream instance for each thread and they will all share the same ThreadSafeStream wrapper instance. I suspect there will probably be a lot of bottlenecking in the ThreadSafeStream methods, but you should be able to gain some parallelism from this.

Phonemic answered 5/5, 2011 at 18:42 Comment(3)

Microsoft has this standard phrase in its documentation (they put it even on the classes without publc static members): "Any public static members of this type are thread safe. Any instance members are not guaranteed to be thread safe." It beats me. It it is too ambiguous to my mind. – Tolerant 7/5, 2011 at 13:50

Brian: what do you mean under "make sure all static members are thread-safe when coding classes"? That all writes (and reads) from non-static methods to/from static fields are done under a lock(){} - it is pretty big endeavor. Most reasonably we can hope that all static members are initialized to the same value by all objects of class which we will instantiate, and never changed after that - only read. – Tolerant 7/5, 2011 at 14:4

@Denis: That is exactly what it means. No matter how difficult it may seem you pretty much have to make all static field access thread-safe. Otherwise, the entire class could not be used at all in multithreaded environments like ASP.NET for example. Normally it's not so bad because often times the static fields are initialized once and never written to again so you really don't need to lock in those cases. – Phonemic 7/5, 2011 at 20:5

P

11

This is a really interesting question. Compression is highly CPU intensive, relying on lots of searching and comparisons. So it's very appropriate to want to parallelize it, when you've got multiple CPUs with unimpeded memory access.

There is a class called ParallelDeflateOutputStream within the DotNetZip library that does what you are describing. The class is documented here.

It can be used only for compression - no decompression. Also it is strictly an output stream - you cannot read in order to compress. Considering these constraints, it is basically a DeflateOutputStream, that internally uses multiple threads.

The way it works: It breaks up the incoming stream into chunks, then drops off each chunk into a separate worker thread to be compressed individually. Then it merges all those compressed streams back into one ordered stream at the end.

Suppose the "chunk" size maintained by the stream is N bytes. As the caller invokes Write(), data is buffered into a bucket or chunk. Inside the Stream.Write() method, when the first "bucket" is full, it calls ThreadPool.QueueUserWorkItem, allocating the bucket to the workitem. Subsequent writes into the stream begin filling the next bucket, and when that is full, Stream.Write() calls QUWI again. Each worker thread compresses its bucket, using a "Flush Type" of Sync (see the deflate spec), and then marks its compressed blob ready for output. This various outputs are then re-ordered (because chunk n does not necessarily get compressed before chunk n+1), and written to the captive output stream. As each bucket is written, it is marked empty, ready to be re-filled by the next Stream.Write(). Each chunk must be compressed with the flush type of Sync in order to allow their re-combination via simple concatenation, for the combined bytestream to be a legal DEFLATE Stream. The final chunk needs Flush type = Finish.

The design of this stream means that callers don't need to write with multiple threads. Callers just create the stream as normal, like the vanilla DeflateStream used for output, and write into it. The stream object uses multiple threads, but your code doesn't interface directly with them. The code for a "user" of the ParallelDeflateOutputStream looks like this:

using (FileStream raw = new FileStream(CompressedFile, FileMode.Create))
{
    using (FileStream input = File.OpenRead(FileToCompress))
    {
        using (var compressor = new Ionic.Zlib.ParallelDeflateOutputStream(raw))
        {
            // could tweak params of parallel deflater here
            int n;
            var buffer = new byte[8192];
            while ((n = input.Read(buffer, 0, buffer.Length)) != 0)
            {
                compressor.Write(buffer, 0, n);
            }                    
        }
    }
}

It was designed for use within the DotNetZip ZipFile class, but it is quite usable as a standalone compressing output stream. The resulting stream can be de-DELFATED (inflated?) with any inflater. The result is fully compliant to the spec.

The stream is tweakable. You can set the size of the buffers it uses, and the level of parallelism. It doesn't create buckets without bound, because for large streams (gb scale and so on) that would cause out of memory condiitons. So there's a fixed limit to the number of buckets, and therefore the degree of parallelism, that can be supported.

On my dual-core machine, this stream class nearly doubled the compression speed of large (100mb and larger) files, when compared to the standard DeflateStream. I don't have any larger multi-core machines so I couldn't test it further. The tradeoff is that the parallel implementation uses more CPU and more memory, and also compresses slightly less efficiently (1% less for large files) because of the sync framing I described above. The performance advantage will vary depending on the I/O throughput on your output stream, and whether the storage can keep up with the parallel compressor threads.

Caveat:
It is a DEFLATE stream, not GZIP. For the differences, read RFC 1951 (DEFLATE) and RFC 1952 (GZIP).

But if you really NEED gzip, the source for this stream is available, so you can view it and maybe get some ideas for yourself. GZIP is really just a wrapper on top of DEFLATE, with some additional metadata (like Adler checksum, and so on - see the spec). It seems to me that it would not be very difficult to build a ParallelGzipOutputStream, but it may not be trivial, either.

The trickiest part for me was getting the semantics of Flush() and Close() to work properly.

EDIT

Just for fun, I built a ParallelGZipOutputStream, that basically does what I described above, for GZip. It uses .NET 4.0's Tasks in lieu of QUWI to handle the parallel compression. I tested it just now on a 100mb text file generated via a Markov Chain engine. I compared the results of that class against some other options. Here's what it looks like:

uncompressed: 104857600
running 2 cycles, 6 Flavors

System.IO.Compression.GZipStream:  .NET 2.0 builtin
  compressed: 47550941
  ratio     : 54.65%
  Elapsed   : 19.22s

ICSharpCode.SharpZipLib.GZip.GZipOutputStream:  0.86.0.518
  compressed: 37894303
  ratio     : 63.86%
  Elapsed   : 36.43s

Ionic.Zlib.GZipStream:  DotNetZip v1.9.1.5, CompLevel=Default
  compressed: 37896198
  ratio     : 63.86%
  Elapsed   : 39.12s

Ionic.Zlib.GZipStream:  DotNetZip v1.9.1.5, CompLevel=BestSpeed
  compressed: 47204891
  ratio     : 54.98%
  Elapsed   : 15.19s

Ionic.Exploration.ParallelGZipOutputStream: DotNetZip v1.9.1.5, CompLevel=Default
  compressed: 39524723
  ratio     : 62.31%
  Elapsed   : 20.98s

Ionic.Exploration.ParallelGZipOutputStream:DotNetZip v1.9.1.5, CompLevel=BestSpeed
  compressed: 47937903
  ratio     : 54.28%
  Elapsed   : 9.42s

Conclusions:

The GZipStream that's builtin to .NET is pretty fast. It's also not very efficient, and it is not tunable.
The "BestSpeed" on the vanilla (non-parallelized) GZipStream in DotNetZip is about 20% faster than the .NET builtin stream, and gives about the same compression.
Using multiple Tasks for compression can cut about 45% off the time required on my dual-core laptop (3gb RAM), comparing the vanilla DotNetZip GZipStream to the parallel one. I suppose the time savings would be higher for machines with more cores.
There is a cost to parallel GZIP - the framing increases the size of the compressed file by about 4%. This won't change with the number of cores used.

The resulting .gz file can be decompressed by any GZIP tool.

Payload answered 6/5, 2011 at 17:10 Comment(3)

+1 Very interesting info. I should consider using Ionic for future projects. But for now we do not want to switch libraries. – Tolerant 10/5, 2011 at 17:36

Are you thinking about pulling this into the mainline code for DotNetZip? It seems useful. – Resignation 3/1, 2013 at 16:49

@ErickT - yes, that parallel GZip thing is on the short list for inclusion into a future release! – Payload 3/1, 2013 at 19:30

C

1

My understanding is that the zip is writing to (or reading from) a single underlying stream; so my assumption would be a resounding no; this cannot be thread safe if you are talking about a single underlying stream.

However; separate instances talking to separate underlying streams should be fine; and indeed it is usually easier to run separate (unrelated) tasks in parallel than it is to parallelise a single task.

Candlemaker answered 5/5, 2011 at 16:47 Comment(1)

Yes, I was not talking about writing from several threads into one output stream. :-) – Tolerant 7/5, 2011 at 13:36

P

0