Any seekable compression library?

Asked 12/1, 2010 at 3:39 Answered 13/8, 2013 at 23:35

I'm looking for a general compression library that supports random access during decompression. I want to compress wikipedia into a single compressed format and at the same time I want to decompress/extract individual articles from it.

Of course, I can compress each articles individually, but this won't give much compression ratio. I've heard LZO compressed file consists of many chunks which can be decompressed separately, but I haven't found out API+documentation for that. I can also use the Z_FULL_FLUSH mode in zlib, but is there any other better alternative?

Cleanup answered 12/1, 2010 at 3:39 Comment(4)

If you want random access, you're likely going to have to chunk the input one way or another. What better way of chunking is there than by-article? – Cyanohydrin 12/1, 2010 at 3:42

... which pretty much all compression libraries support, the article equating to a file entry. – Expecting 12/1, 2010 at 3:47

Also... I doubt there will be much of a difference between the compression ratio for articles individually vs. the compression ratio for the whole thing, as they should have basically the same ratio of letter occurrences. Have you benchmarked that claim? – Expecting 12/1, 2010 at 3:48

The problem with compressing articles individually is that you wind up with a whole batch of files, and that introduces inefficiencies itself. Having, say, a 1G file is going to be a lot more space-efficient and easy to use than having 80,000 files averaging about 10K each, even if it is greater compression. – Heuristic 5/2, 2010 at 18:0

xz-format files support an index, though by default the index is not useful. My compressor, pixz, creates files that do contain a useful index. You can use the functions in the liblzma library to find which block of xz data corresponds to which location in the uncompressed data.

Hazelwood answered 20/12, 2012 at 20:45 Comment(2)

Looks like a promising project. Thanks. – Cleanup 22/12, 2012 at 6:32

I use pixz every day. It's awesome. Thanks! – Humanist 19/6, 2021 at 2:45

for seekable compression build on gzip, there is dictzip from the dict server and sgzip from sleuth kit

note that you can't write to either of these and as seekable is reading any way

Feticide answered 5/8, 2010 at 16:51 Comment(0)

DotNetZip is a zip archive library for .NET.

Using DotNetZip, you can reference particular entries in the zip randomly, and can decompress them out of order, and can return a stream that decompresses as it extracts an entry.

With the benefit of those features, DotNetZip has been used within the implementation of a Virtual Path Provider for ASP.NET, that does exactly what you describe - it serves all the content for a particular website from a compressed ZIP file. You can also do websites with dynamic pages (ASP.NET) pages.

ASP.NET ZIP Virtual Path Provider, based on DotNetZip

The important code looks like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    public class ZipFileVirtualPathProvider : System.Web.Hosting.VirtualPathProvider
    {
        ZipFile _zipFile;

        public ZipFileVirtualPathProvider (string zipFilename) : base () {
            _zipFile =  ZipFile.Read(zipFilename);
        }

        ~ZipFileVirtualPathProvider () { _zipFile.Dispose (); }

        public override bool FileExists (string virtualPath)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualPath, true);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry == null)
                return false;

            return !zipEntry.IsDirectory;
        }

        public override bool DirectoryExists (string virtualDir)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualDir, false);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry != null)
                return false;

            return zipEntry.IsDirectory;
        }

        public override VirtualFile GetFile (string virtualPath)
        {
            return new ZipVirtualFile (virtualPath, _zipFile);
        }

        public override VirtualDirectory GetDirectory (string virtualDir)
        {
            return new ZipVirtualDirectory (virtualDir, _zipFile);
        }

        public override string GetFileHash(string virtualPath, System.Collections.IEnumerable virtualPathDependencies)
        {
            return null;
        }

        public override System.Web.Caching.CacheDependency GetCacheDependency(String virtualPath, System.Collections.IEnumerable virtualPathDependencies, DateTime utcStart)
        {
            return null;
        }
    }
}

And VirtualFile is defined like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    class ZipVirtualFile : VirtualFile
    {
        ZipFile _zipFile;

        public ZipVirtualFile (String virtualPath, ZipFile zipFile) : base(virtualPath) {
            _zipFile = zipFile;
        }

        public override System.IO.Stream Open () 
        {
            ZipEntry entry = _zipFile[Util.ConvertVirtualPathToZipPath(base.VirtualPath,true)];
            return entry.OpenReader();
        }
    }
}

Spur answered 5/2, 2010 at 17:50 Comment(0)

bgzf is the format used in genomics. http://biopython.org/DIST/docs/api/Bio.bgzf-module.html

It is part of the samtools C library and really just a simple hack around gzip. You can probably re-write it yourself if you don't want to use the samtools C implementation or the picard java implementation. Biopython implements a python variant.

Gelatin answered 13/8, 2013 at 23:35 Comment(2)

It looks like a simple wrapper library on top of zlib. Data is split into 64k blocks and compressed independently. But I guess one can get better compression ratio. – Cleanup 14/8, 2013 at 10:55

It is just a simple hack but it works out of the box and has a useful command line tool. You can probably get better compression but this is something that works right now. – Gelatin 28/8, 2013 at 0:38

If individual articles are too short to get a decent compression ratio, the next-simplest approach is to tar up a batch of Wikipedia articles -- say, 12 articles at a time, or however many articles it takes to fill up a megabyte. Then compress each batch independently.

In principle, that gives better compression than than compressing each article individually, but worse compression than solid compression of all the articles together. Extracting article #12 from a compressed batch requires decompressing the entire batch (and then throwing the first 11 articles away), but that's still much, much faster than decompressing half of Wikipedia.

Many compression programs break up the input stream into a sequence of "blocks", and compress each block from scratch, independently of the other blocks. You might as well pick a batch size about the size of a block -- larger batches won't get any better compression ratio, and will take longer to decompress.

I have experimented with several ways to make it easier to start decoding a compressed database in the middle. Alas, so far the "clever" techniques I've applied still have worse compression ratio and take more operations to produce a decoded section than the much simpler "batch" approach.

For more sophisticated techniques, you might look at

MG4J: Managing Gigabytes for Java
"Managing Gigabytes: Compressing and Indexing Documents and Images" by Ian H. Witten, Alistair Moffat, and Timothy C. Bell

Juncaceous answered 12/1, 2010 at 3:39 Comment(0)

You haven't specified your OS. Would it be possible to store your file in a compressed directory managed by the OS? Then you would have the "seekable" portion as well as the compression. The CPU overhead will be handled for you with unpredictable access times.

Worldwide answered 12/1, 2010 at 3:48 Comment(4)

I'd prefer a portable library among different OS. Compressed file system is certainly a solution, but does that perform well (in terms of speed and memory) under random access? – Cleanup 12/1, 2010 at 4:9

you're trading space for speed. COmpression costs. – Worldwide 12/1, 2010 at 13:1

@NoRefundsNoReturns At least disk drives are so slow today compared to CPUs that reading from compressed files systems is faster (at least on ZFS here) unless you have a CPU load per CPU greater than 1 already. – Mantra 10/2, 2020 at 11:9

Thank you. I can't wait to see what the 2030 readers will have to say. Is there a badge for trolling a decade-old comment? If not can I suggest Through the years for this achievement? – Worldwide 12/2, 2020 at 2:28

I'm using MS Windows Vista, unfortunately, and I can send the file explorer into zip files as if they were normal files. Presumably it still works on 7 (which I'd like to be on). I think I've done that with the corresponding utility on Ubuntu, also, but I'm not sure. I could also test it on Mac OSX, I suppose.

Heuristic answered 5/2, 2010 at 17:58 Comment(0)

Recommended topics

Hot tags