Is there a way to compress a string into a smaller string with reversibility?

H

4

7

I am trying to transmit strings over the iridium network, and the costs of sending data is pretty large. I am wondering if there is a way to compress a large string, for example: {"packet":01,"reporting time":1500, "altitude":6500,"latitude":0,"longitude": 0,"ballast":34,"parachute":0}

into a much smaller string, like: f5fk43d2 . The process must be reversible, so that the data can be decoded and read on the other end. Is this possible, if so, how would I go about doing this.

I have tried this answer by j.w.r: Shortening a string in Java , however it seems irreversible. It does convert a large string into a smaller one.

The process must result in a string smaller than the original.

Any help is appreciated!

Hag answered 26/12, 2018 at 1:1 Comment(5)

You could look at a common compression scheme like deflate (used for .zip files) which has an implementation in Java’s standard libraries (Deflater) – Sulfonmethane 26/12, 2018 at 1:8

Have you considered using something like MessagePack? Or maybe just simply compressing the text through a zip compression and converting it to base64? – Staphylo 26/12, 2018 at 1:9

CBOR could be an option. And then base64 it if you don't want to send binary – Dive 26/12, 2018 at 1:10

If you had a limited number of possible values for the parts in quotes, and also knew that the parts after each colon was a number within a certain range; then you could take advantage of that to express this in a shorter string. – Crannog 26/12, 2018 at 1:35

Something I did was convert the string to a zip stream, and convert the bytes back into base64 string. Then can convert back to bytes, unzip, and there's your string. I did that in C# though :( no java code. I also added a param "max length" so an exception is thrown if I went over the length string I wanted. – Webb 13/12, 2022 at 21:31

C

0

{"packet":01,"reporting time":1500, "altitude":6500,"latitude":0,"longitude": 0,"ballast":34,"parachute":0}

compresses to 01,1500,65000,0,34,0 effortless.

But you should not be sending text, you should be sending data. Using text "01" is at minimum 2 bytes. But a single byte can store up to the number 255. I would do something like
1 (1 byte representing the packet number 0-255)
1500 (3 bytes representing the reporting time 0-16,777,215)
...
You dont need to use whole bytes, if the data will fit in less.

Some things to consider: if you know the bounds of all the data, you dont need separators (,)
If you have a bunch of variables you might send or not, you can include a name byte or name 4 bits(packet could be variable b0000, altitude could be variable b0011) that goes before every variable, or one that goes at the front of the message telling you which type of message you are sending.

If you must send the variable names, as they are dynamically created. You can compress based on the allowed characters. For example, if you are only using lower case and space, like your example, that is 27 characters which will fit in under 5 bits per character. Basically, just treat the string as base 27 integer, and it compresses and decompresses easily.
"abcd" => 1, 2, 3, 4 => 1*27^3 + 2*27^2 + 3*27^1 + 4 => 21226 or 2 bytes instead of 4

Crossed answered 1/2 at 0:5 Comment(0)

G

5

First, hopefully it's clear that there does not exist any lossless compression algorithm that can take an arbitrary string of length n and always compress it to a unique, shorter string. That's a fact of math.

That being said, there's some popular algorithms that work pretty well:

Huffman encoding: fairly beginner-friendly and possible to implement yourself. The basic idea is to map more common characters to shorter binary strings and less common ones to longer binary strings, and then package that with a map that tells you how to decode the resultant bitstring. The downside is the extra space you need to store the decoding instructions

Lempel-Ziv: I've never implemented this myself, but it's the basis for a lot of common file formats we know today, like GIFs. There should be libraries out there for this.

Granados answered 26/12, 2018 at 1:29 Comment(3)

First, hopefully it's clear that there does not exist any lossless compression algorithm that can take an arbitrary string of length n and always compress it to a unique, shorter string. That's a fact of math. - there are tons of lossless compression methods. They are simply less effectively in the amount of characters that they could remove. – Lymphosarcoma 26/12, 2018 at 2:49

@Lymphosarcoma Of course it's a fact of math. Compression algorithms rely on certain strings being more common than others. They make the "common" strings shorter, but make the less common strings longer. There is absolutely no reversible algorithm that makes every possible string shorter. If there were, you could simply apply it over and over until the string becomes a single bit, then magically transform either 0 or 1 into the complete works of Shakespeare. – Crannog 26/12, 2018 at 3:43

@Lymphosarcoma Yes, but there are no "perfect" lossless compression methods. To see this, take a bitstring of length n. There's 2^n such possible strings. There are 2^(n-1) + 2^(n-2) + ... + 2^0 possible strings shorter than that. That sums to 2^n - 1 strings, so by the piegonhole principle, either two strings have the same compressed form (uh-oh — this is the same idea as a hash collision) — or some compressed form is longer than the original, which can happen easily in Huffman encoding when storing the map if your original string is short and made of a lot of unique characters. – Granados 26/12, 2018 at 3:47

H

4

Consider the mathematics of attempting to convert some X-character string to a Y-character string, such that X > Y (i.e. you're trying to shorten the length of the string).

Then, let's say that the string is alphanumeric; this gives us 26 possible lowercase letters, 26 possible uppercase letters, and 10 possible numbers that we can use (i.e. 62 possibilities). This means that for an X-character string, we would have 62^X possible strings, and for a Y-character string, we would have 62^Y possible strings.

Now, consider if we try to map all of our X-character strings to our Y-character strings. Let's let the function f(S) map a string S (an X-character string) to a Y-character string. Then, because X > Y, we will necessarily have to map some X-character strings to some of the same Y-character strings. Consider the following simple example:

X = 3. Y = 2. Then, we have 62^3 possible 3-character strings (238,000), and 62^2 (3800) possible Y-character strings. Then, we have 234,000 more 3-character strings than 2-character strings.

Now, imagine we tried to have some function f(S) where we tried to make every 3-character string into a 2-character string. Then, we'd naturally have an issue when we tried to convert a 2-character string back into a 3-character string, because this means that f(S) must convert some 3-character strings into the same string (so we couldn't know which one to map back to!). This is because the domain of 2-character strings is less than the domain of 3-character strings (and occurs because f(S) then cannot be injective, meaning there is no valid inverse).

Thus, there aren't enough 2-character strings to possibly map back to every 3-character string, and you'll find that this generalizes to all X > Y.

You could possibly restrict some characters from the domain of your larger strings, though exactly as you have stated the problem, this is not possible.

Edit, because I feel as though I should mention this: There are algorithms used to compress strings of lesser characters to smaller strings of more characters. With that being said, I'd recommend taking a look at this: An efficient compression algorithm for short text strings

Haggar answered 26/12, 2018 at 1:22 Comment(1)

Have you worked with Smaz, referenced in the answer you linked? – Hag 26/12, 2018 at 22:23

W

0

Let's start with your example as a characterization of your vague "much smaller". You compress 107 characters (856 bits) into eight alphanumeric characters, which appear anyway to be limited to 36 possibilities for each character. I'll be generous, and assume that capital letters are permitted as well, and maybe two punctuation characters for spice, bumping it up to 64 possible characters. So that's six bits per character times eight characters, or 48 bits. That's a factor of 18 compression. No, you're not going to get that losslessly, at least not without a huge amount of redundancy in the data not evidenced in the example. I'll be generous again and assume the messages being compressed are limited to 96 possible ASCII characters (say, removing 127 and including new line). Then the message is 705 bits, with a factor of almost 15 compression to get to 48 bits. Still not happening.

Lossless compression comes from statistical bias and redundancy. Statistical bias is the prevalence of some symbols over others, and redundancy is repeated patterns in the data, e.g. repeated substrings, like "itude" and "500," in your example. To get good compression you need those things to exploit, and you need a lot of data to be able to take advantage of them. Short strings like your example will hardly compress or often not compress at all, if taken in isolation.

What you could try would be to maintain a compression context and associated decompressed context on the other end, through which you send a series of messages in a well-defined order. I.e. they need to be decompressed in the same order that they were compressed. Then you would be able to take advantage of redundancy and bias over many messages, and possibly get some decent compression. If those same JSON properties keep reappearing, and better yet, if they often have the same values, then you could get significant compression.

The flush operation of, for example, zlib, would allow sending the data compressed so far in order to avoid latency that the compressor would otherwise introduce to build up a block. You would want to avoid flushes if possible, since they reduce compression. So you could have a time limit for how long you're willing to wait for another message to transmit before flushing out the last one to send.

Wystand answered 26/12, 2018 at 22:28 Comment(0)

C

0