String encoding of primitive types preserving lexicographic order
Asked Answered
C

4

9

Does anyone know of a library for encoding a number of primitive types (like integers, floats, strings, etc) into a string but preserving the lexicographical order of the types?

Ideally, I'm looking for a C++ library, but other languages are fine too. Also, one can assume that the format does not need to be encoded in the string itself (that is, if it's int64/string/float then the encoded string does not need to encode this information, only encoding the data is enough).

Charissa answered 19/11, 2009 at 3:31 Comment(3)
Could you clarify what you want?Prorogue
What do you mean by lexicographic order with respect to integers and floats? Their lexicographic sorting depends on how you encode them, e.g. binary, octal, decimal, hex etc. (assuming leading digits removed) all will give different lexicographical sorts for a given list of numbers.Floppy
By lexicographical order I mean, the original order of the primitive types (not the string, obviously). Say, encode "(a, b, c)" into a string "s", such that "(a, b, c) < (a', b', c')" implies that "s < s'" for all a, b, c.Charissa
S
10

Take a look at this paper ("Efficient Lexicographic Encoding of Numbers") which shows how to represent any numeric type as a string such the lexicographic order of the strings is the same as the numerical order of the underlying numbers. It copes with arbitrary length numbers.

http://www.zanopha.com/docs/elen.pdf

Secretory answered 29/7, 2010 at 19:48 Comment(2)
Interesting... I'm taking a look at the paper.Charissa
Just implemented this. Works provided a minor modification. The ASCII '+' character has integer value 43, which is lower and '0' (integer value 48). This provides incorrect sorting semantics. By using a character that is higher up in the ASCII plane, like '=' (integer value 61) gives correct results even when comparing strings with a different number of prefix characters.Indemnity
C
3

I had the problem of converting integers and longs to strings which preserve ordering. And since I was working in Java, I only had signed types.

My algorithm was very simple:

  1. Flip the sign bit (toEncode ^ Long.MAX_VALUE for longs) otherwise negative numbers are greater than positive numbers.
  2. Do a modified base64 encoding of the bytes. Unfortunately, the normal base64 encoding doesn't preserve ordering; the special characters (+ and /) are after the numbers which are after the characters. This is completely backwards from ASCII. My modified encoding simply uses the ASCII ordering. (To make it clear it wasn't normal base64, I changed the special chars to - and _ with ~ as the padding. These are still useable within an URL, which was another a constraint I had.)
Congruity answered 7/7, 2012 at 23:44 Comment(3)
Is your modified Base 64 encoder part of any open-source Java library?Temperature
@Temperature Looks like codeproject.com/Articles/5165340/Sortable-Base64-EncodingEyeleteer
Modifying the character mapping such that it follows ASCII ordering for docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/… did not work for me but doing it for guava.dev/releases/17.0/api/docs/com/google/common/io/… worked and preserved the order.Endocarditis
P
2

BTW ... In Amazon Web Service's SimpleDB, all data are stored as strings. Its select comparators use lexicographic ordering. AWS provides utility functions to encode various types. For example, integers are encoded knowing the range of the integers apriori and adjusting via zero-padding and offsets (e.g. for negative integers). You could of course give it the worst possible range.

See "Query 201: Tips and Tricks for Amazon SimpleDB Query" - http://aws.amazon.com/articles/1232

http://typica.s3.amazonaws.com/com/xerox/amazonws/sdb/DataUtils.html

Pernick answered 19/2, 2013 at 3:33 Comment(0)
D
-1

Just write numeric values in a fixed column width with leading zeros, and strings as normal. So like this:

0.1 -> 0000000.1000000
123 -> 0000123.0000000
foo -> foo
X   -> X

Then you can sort as text (e.g. Unix sort without -n). How about that?

Daddylonglegs answered 19/11, 2009 at 13:2 Comment(2)
I would like to avoid encoding numbers in fixed width. Also, encoding strings as themselves won't work give the right sorting order if the string has the same character that you're using as a separator.Charissa
Then write your own sort routine.Daddylonglegs

© 2022 - 2024 — McMap. All rights reserved.