An efficient compression algorithm for short text strings [closed]

Asked 16/7, 2009 at 15:15 Answered 7/8, 2011 at 10:33

153

I'm searching for an algorithm to compress small text strings: 50-1000 bytes (i.e. URLs). Which algorithm works best for this?

Lolanthe answered 16/7, 2009 at 15:15 Comment(11)

Where do you want to use these compressed strings? – Permanganate 16/7, 2009 at 15:17

Is this going towards tinyurls or something to do with storage space? – Yumuk 16/7, 2009 at 15:21

No, tinyurls is not the anwser here. – Lolanthe 16/7, 2009 at 15:23

Would you care to elaborate basilio? What do you want to get or, what is this compression targeted towards? – Yumuk 16/7, 2009 at 15:33

I'm interested in an algorithm for compressing URLs, best compression ratio is more important then running cost. Not interested in online services like tinyurls or tr.im. I'm looking for an algorithm not a service. Don't think any other info could be useful... – Lolanthe 16/7, 2009 at 15:40

@Gumbo: "Text compression algorithms for short strings" is enough for finding algos, why are you so interested in knowing what they are for? I'm sure the OP will be able to find the one that does what he wants. – Carnauba 16/7, 2009 at 16:51

Is this a duplicate of "Compression of ASCII strings in C"? – Antiperistalsis 29/7, 2010 at 19:40

I have a similar problem to the OP - I am storing application state information in the location hash and would like to compress it to shorten it. – Delaware 27/10, 2010 at 18:41

@Vasily, a small hint: Whenever you're asking a question on SO in the form of, "What is the best XYZ?", your question is almost bound to receive votes for closing because asking for the best might lead to unnecessary product comparisons, or in the worst case, even flame wars. (It usually takes only a very small change to avoid that: If you asked the same question like, "Please suggest a XYZ.", you wouldn't get as many closing votes, even though it's basically the same question!) – Suzetta 16/9, 2011 at 20:17

~50% compression for URLs - see blog.alivate.com.au/packed-url – Expatriate 7/6, 2018 at 23:58

15 years and still no real answer .. nice – Rocco 30/4 at 2:4

Check out Smaz:

Smaz is a simple compression library suitable for compressing very short strings.

Cartan answered 16/7, 2009 at 15:46 Comment(5)

See github.com/antirez/smaz/blob/master/smaz.c -- this is a variant of coding, not compression per se (at least not entirely). He uses a static word and letter dictionary. – Delaware 27/10, 2010 at 18:46

Note: This is antirez's project. He's one of the principal authors of Redis and has a very strong reputation of releasing high quality, production code. – Cuevas 5/3, 2014 at 23:54

The smaz algorithm is optimized for English texts, therefore does not work well for random strings. Here are some samples (string:orig_size:compr_size:space_savings): This is the very end of it.:27:13:52%, Lorem ipsum dolor sit amet:26:19:27%, Llanfairpwllgwyngyll:20:17:15%, aaaaaaaaaaaaa:13:13:0%, 2BTWm6WcK9AqTU:14:20:-43%, XXX:3:5:-67% – Psychognosis 23/3, 2014 at 11:41

Also take a look at a lower compression but a fast algorithm shoco ed-von-schleck.github.io/shoco – Volkman 7/11, 2014 at 2:54

Add my library Unishox to the list github.com/siara-cc/unishox. It performs better than Smaz and Shoco and supports compressing UTF-8 strings. – Disjoin 15/2, 2020 at 12:29

Huffman has a static cost, the Huffman table, so I disagree it's a good choice.

There are adaptative versions which do away with this, but the compression rate may suffer. Actually, the question you should ask is "what algorithm to compress text strings with these characteristics". For instance, if long repetitions are expected, simple Run-Lengh Encoding might be enough. If you can guarantee that only English words, spaces, punctiation and the occasional digits will be present, then Huffman with a pre-defined Huffman table might yield good results.

Generally, algorithms of the Lempel-Ziv family have very good compression and performance, and libraries for them abound. I'd go with that.

With the information that what's being compressed are URLs, then I'd suggest that, before compressing (with whatever algorithm is easily available), you CODIFY them. URLs follow well-defined patterns, and some parts of it are highly predictable. By making use of this knowledge, you can codify the URLs into something smaller to begin with, and ideas behind Huffman encoding can help you here.

For example, translating the URL into a bit stream, you could replace "http" with the bit 1, and anything else with the bit "0" followed by the actual procotol (or use a table to get other common protocols, like https, ftp, file). The "://" can be dropped altogether, as long as you can mark the end of the protocol. Etc. Go read about URL format, and think on how they can be codified to take less space.

Naidanaiditch answered 16/7, 2009 at 15:27 Comment(3)

Not if the huffman table is the same for all files, which would make sense if the files are all similar to each other. – Bolivia 16/7, 2009 at 15:45

If you have many, similar, small files, you are doing it all wrong. First, concatenate them all (like tar does), and then compress that. You'll get better compression, and the problem ceases to be "50-1000 bytes". – Naidanaiditch 16/7, 2009 at 15:51

@Daniel: depends whether you want random access to the compressed data. Compressing it all together prevents that with most compression systems. – Airport 16/7, 2009 at 16:19

I don't have code to hand, but I always liked the approach of building a 2D lookup table of size 256 * 256 chars (RFC 1978, PPP Predictor Compression Protocol). To compress a string you loop over each char and use the lookup table to get the 'predicted' next char using the current and previous char as indexes into the table. If there is a match you write a single 1 bit, otherwise write a 0, the char and update the lookup table with the current char. This approach basically maintains a dynamic (and crude) lookup table of the most probable next character in the data stream.

You can start with a zeroed lookup table, but obviosuly it works best on very short strings if it is initialised with the most likely character for each character pair, for example, for the English language. So long as the initial lookup table is the same for compression and decompression you don't need to emit it into the compressed data.

This algorithm doesn't give a brilliant compression ratio, but it is incredibly frugal with memory and CPU resources and can also work on a continuous stream of data - the decompressor maintains its own copy of the lookup table as it decompresses, thus the lookup table adjusts to the type of data being compressed.

Sale answered 16/7, 2009 at 16:45 Comment(4)

But how would predictor behave with normal English sentence? The given example has very strong redundancy, and the gain is minimal. – Herzig 14/1, 2015 at 10:14

A 256*256 lookup table doesn't sound "incredibly frugal with memory" ...! – Entelechy 24/2, 2017 at 11:16

@Entelechy Well it's 65 kilobytes. – Sale 24/2, 2017 at 12:13

@Sale If it had been 65 bytes I might have agreed ! – Entelechy 27/2, 2017 at 11:37

Any algorithm/library that supports a preset dictionary, e.g. zlib.

This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all URLs, all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.

Every compression algorithm will save space if the same substring is repeated multiple times in one input file (e.g. "the" in English text or "int" in C code.)

But in the case of URLs certain strings (e.g. "http://www.", ".com", ".html", ".aspx" will typically appear once in each input file. So you need to share them between files somehow rather than having one compressed occurrence per file. Placing them in a preset dictionary will achieve this.

Bolivia answered 16/7, 2009 at 15:42 Comment(1)

Tips on using the custom dictionary: stackoverflow.com/questions/2011653 – Palaeozoology 24/7, 2015 at 23:34

Huffman coding generally works okay for this.

Garrick answered 16/7, 2009 at 15:21 Comment(2)

This is not a link-only answer; without the link, it's still a valid answer. – Alphabetical 14/1, 2015 at 12:26

..and still not a good answer. (Not enough relevant information brought in.) – Handling 3/11, 2018 at 2:33

You might want to take a look at Standard Compression Scheme for Unicode.

SQL Server 2008 R2 use it internally and can achieve up to 50% compression.

Lackey answered 7/8, 2011 at 10:33 Comment(1)

SCSU 'compresses' non-English Unicode in UTF-16/MB encodings. If English-based Unicode / plain-old-ASCII, UTF-8 also 'compresses' 50% of UTF-16.. – Handling 3/11, 2018 at 2:23

If you are talking about actually compressing the text not just shortening then Deflate/gzip (wrapper around gzip), zip work well for smaller files and text. Other algorithms are highly efficient for larger files like bzip2 etc.

Wikipedia has a list of compression times. (look for comparison of efficiency)

Name       | Text         | Binaries      | Raw images
-----------+--------------+---------------+-------------
7-zip      | 19% in 18.8s | 27% in  59.6s | 50% in 36.4s
bzip2      | 20% in  4.7s | 37% in  32.8s | 51% in 20.0s
rar (2.01) | 23% in 30.0s | 36% in 275.4s | 58% in 52.7s
advzip     | 24% in 21.1s | 37% in  70.6s | 57& in 41.6s
gzip       | 25% in  4.2s | 39% in  23.1s | 60% in  5.4s
zip        | 25% in  4.3s | 39% in  23.3s | 60% in  5.7s

Verisimilar answered 16/7, 2009 at 15:24 Comment(5)

He wants to compress text and not files. – Permanganate 16/7, 2009 at 15:29

You can compress text and binaries with these algorithms. In fact we use deflate within a cms system that runs in python. – Verisimilar 16/7, 2009 at 16:8

An example in C# using gzip for strings is here: csharphelp.com/archives4/archive689.html – Verisimilar 16/7, 2009 at 16:10

zlib module in python for compressing strings: python.org/doc/2.5.2/lib/module-zlib.html – Verisimilar 16/7, 2009 at 16:11

gzip (and zlib) uses deflate and adds wrapper/framing overhead.. direct deflate/LZ77 (dictionary overhead and efficiency still depends on implementation of such and settings) can reduce the break-even overhead. This is for "short" strings in the dozens to hundreds of characters, of course (still should have a bit to indicate "was this compressed"? to avoid enlarging data). Larger extra overhead doesn't matter.. as text increases. The numbers posted here appear to be for large text-files (many seconds to run!), while OP asks for 50-1000 charters - very small in comparison. – Handling 3/11, 2018 at 2:29

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags