Simple API for random access into a compressed data file
Asked Answered
N

8

6

Please recommend a technology suitable for the following task.

I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.

What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.

Of course, I hope to use existing library rather than reinvent the NIH wheel.

Nursemaid answered 1/7, 2010 at 16:8 Comment(0)
T
2

If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.

Talaria answered 7/11, 2010 at 6:46 Comment(0)
A
1

Byte Pair Encoding allows random access to data.

You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.

However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.

One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.

If it's accessed rarely, you could compress the index with gzip and decode it when needed.

Ardatharde answered 1/7, 2010 at 16:18 Comment(1)
Can you recommend me an existing implementation?Nursemaid
D
1

If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).

Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).

Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

Ditheism answered 1/7, 2010 at 16:28 Comment(2)
That's a good idea. Unfortunately, I don't know of any existing libraries that do this, but it would be easy to implement with a wrapper class.Ardatharde
Editing the data might be a little trickier, but it's doable.Ardatharde
V
1

Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.

It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.

Tipical usage:

dzip -j4 myfile

...

CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);

It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio

Vitascope answered 18/6, 2015 at 4:42 Comment(0)
H
0

Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.

Hertfordshire answered 1/7, 2010 at 16:13 Comment(2)
While the raw data is processed in blocks, the compressed data blocks aren't the same size so you can't jump around in the compressed data to find a specific block.Ardatharde
If the file is divided into blocks that each represent 65,536() bytes of source data, one can spend four bytes per block on a table saying where each one starts. () One could just as well use a block size of 1,048,576 bytes, but even with 64K blocks a half-gig file would only need a 32Kbyte table.Berky
A
0

I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.

The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.

Boost Releases

Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.

Ambulator answered 1/7, 2010 at 16:14 Comment(1)
boost::iostreams is NOT a header only library; though parts of it are.Camargo
F
0
  1. Sort the big file first
  2. divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
  3. take first ID from each chunk plus the filename and put both data into another file
  4. compress the chunks
  5. you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.

If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files. on the web exists several implementation of this because are little tricky the code.

Forrest answered 1/7, 2010 at 16:47 Comment(2)
I don't think we'd need multiple compressed files if we had the index.Ardatharde
Gustavo, thanks, but I know how to design it myself. I need someone to point me to an existing library... I cannot be the first one confronting this..Nursemaid
C
0

You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2

Canzona answered 17/12, 2010 at 1:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.