Algorithm for generating a unique (constant) code for a string which should be reversible

Asked 30/3, 2012 at 8:20 Answered 12/3, 2014 at 23:38

The requirement:

We have values in DB like

Chennai
Baroda
Bangalore
New Delhi
São Paulo, Lisboa
San Jose

etc...

So I want to convert these string to a unique short string. For example

Chennai –> xy67kr

San Jose –> iuj73d

basically something similar to URL shortner.

And the algorithm to convert this should be reversible.. i.e when I pass "xy67kr" to a decode function it should give me back "Chennai".

Looking forward for help.

Bendix answered 30/3, 2012 at 8:20 Comment(10)

Do the strings need to be of a fixed length? – Centonze 30/3, 2012 at 8:21

If you have a database, then the processing of reversal should be pretty easy... – Ferrous 30/3, 2012 at 8:21

1 - Strings are not fixed length. Max length = 200 characters 2 - I want to avoid DB call. That is the reason I want to generate an algorithm. Which can be used in DB to encode the strings. Same algorithm can be used to decode and get real value in my web application – Bendix 30/3, 2012 at 8:22

@taher: There is no function that can shorten arbitrary strings in a way that can be reversed (due to the pigeonhole principle). Unless you can place heavy constraints one the values of your input strings, you will have to use some kind of lookup mechanism. – Ferrous 30/3, 2012 at 8:31

I don't get it.. you have it in DB but you don't want to use the DB? – Indue 30/3, 2012 at 8:39

what you can do is compress the strings with something like huffman coding for example - but this only makes sense if your strings are quite long. In the case of "Chennai", "Baroda", etc. it will make things worse. – Arrest 30/3, 2012 at 8:40

Agree with @Oli. There probably are constraints though (eg. characters being a-z thereby limiting range or pigeons) – Novitiate 30/3, 2012 at 8:47

@KarolyHorvath: We have two systems. One which works in backend and is responsible for encoding the string and storing in DB. We have a SOLR indexer which indexes everything and our front end website uses values from index files (also the codes). To get original value from code, I do not want to go back to DB. – Bendix 30/3, 2012 at 8:57

@Oil Charlesworth: To avoid lookup mechanism only I need this algorith :) – Bendix 30/3, 2012 at 9:38

@taher: But this algorithm doesn't exist... – Ferrous 30/3, 2012 at 10:4

As other posters have stated, you cannot have a function that shortens arbitrary strings, that's mathematically impossible. But you can create a custom function that works well with your particular set of strings.

An example approach would be to compute the character frequency in the set, then just code the characters with a prefix code such that the most frequent letters are encoded with short prefixes (i.e. Huffman coding.)

The approach above does not take advantage of the fact that in natural language the next character can be quite accurately predicted from previous ones, so you can extend the above algorithm so that instead of encoding the characters independently, it encodes the next character in an n-gram. This of course requires a larger compression table than the simple approach, since you are effectively having a separate code depending on the prefix. For example if 'e' is very frequent after 'th' then 'e' after 'th' is encoded with a very short prefix. If 'e' is very unfrequent after 'ee', then it can be encoded with a very long prefix in this case. The decoding algorithm obviously needs to look at the currently decompressed prefix to check how to decode the next character.

This general approach assumes that the frequencies do not change, or at least change slowly. If your data set changes than you might have to recompute the statistics and re-code the strings.

Spinel answered 30/3, 2012 at 13:13 Comment(5)

I doubt that this will work well for short input data. It also seems that the OP wants a fixed-length encoding, which clearly is impossible. – Ferrous 30/3, 2012 at 13:21

@OliCharlesworth On the contrary, this kind of statistical encoding works well even for single character strings, save for the fact that even if the resulting code is say, 6 bits, then you still have to send (or save) at least one byte. I agree that fixed-length encoding is impossible. – Judicial 30/3, 2012 at 13:48

Ok, in my original question I asked that my input strings can be of variable length. So, suppose that I make them of fixed length by applying padding, i.e --> New York [becomes] --> New York!@!!@! or something like that. Is it possible then to shorten them after encoding? – Bendix 2/4, 2012 at 6:10

@taher Oli was referring to the length of the string after encoding. The pigeonhole principle states that the only way to guarantee that the final string is fixed is to restrict the set of input strings (so that its size is no larger than the number of the fixed length strings.) The only practical way to do this (for an arbitrary set) is to use a DB, just like URL shorteners do. The best you can do without a DB is to use a compression algorithm tuned to your data. This can achieve very good compression - no fixed size output though. – Judicial 2/4, 2012 at 7:23

Thanks guys, I kind of got my answer. – Bendix 4/4, 2012 at 9:51

See my answer to similar question, and just rewrite it to PHP:

Encoding:

$encoded = base64_encode(gzdeflate("São Paulo, Lisboa"))

Decoding:

$decoded = gzinflate(base64_decode($encoded))

Note that gzdeflate performs better than gzcompress on short strings.

But anyway the problem with this is that for short strings it makes the string longer. This performs better on longer texts. It would be of course better to use some compression algorithm with a-priori information, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also.

Benzel answered 30/3, 2012 at 8:39 Comment(1)

It would be of course better to use some compression algorithm with a-priori information, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also. But then the question is if these methods are accessible within PHP. – Benzel 30/3, 2012 at 10:7

You cannot shorten arbitrary length strings to fixed length one.

What you can do is to create those short strings for the unique ID of the row of that specific string in the database. Here are some tips: How to design a sequential hash-like function.

Indue answered 30/3, 2012 at 8:47 Comment(0)

This isn't necessarily deterministic, but obviously you could use a lookup table. The service would be similar to goo.gl or imgur

Judyjudye answered 12/3, 2014 at 23:38 Comment(0)

Recommended topics

Hot tags