Return a new string that sorts between two given strings
Asked Answered
A

14

46

Given two strings a and b, where a is lexicographically < b, I'd like to return a string c such that a < c < b. The use case is inserting a node in a database sorted by such keys. You can specify the format for a, b, and c if you like, as long as it is possible to generate initial values as well as new values on insert.

Is there a practical algorithm for this?

Androecium answered 12/8, 2016 at 17:20 Comment(6)
you might want to define "lexicographically <" first, this question really hinges on that definition!Kaykaya
for example, if a < ax, ax < b, then appending a single char would be a trivial solutionKaykaya
Thanks Marcus. Then How would I insert a new node between a and ax? I'm looking for something that would keep working on future insertions as well.Androecium
I mean the common meaning of lexicograhically <. Trivial solutions welcome!Androecium
Do you want to limit the length of the strings (I would think so in practice)? Then you can just enumerate them all, so using strings is no different than using integers as keys. If you already use 10 and 20 as keys, there are only 9 options in between. If you keep inserting new keys between two values, you will at some point run out of keys, unless you allow infinite length keys.Depressomotor
Infinite length keys are allowed.Androecium
C
83

Minimising string length

If you want to keep the string lengths to a minimum, you could create a string that is lexicographically halfway between the left and right strings, so that there is room to insert additional strings, and only create a longer string if absolutely necessary.

I will assume an alphabet [a-z], and a lexicographical ordering where an empty space comes before 'a', so that e.g. "ab" comes before "abc".

Basic case

You start by copying the characters from the beginning of the strings, until you encounter the first difference, which could be either two different characters, or the end of the left string:

abcde ~ abchi  ->  abc  +  d ~ h  
abc   ~ abchi  ->  abc  +  _ ~ h  

The new string is then created by appending the character that is halfway in the alphabet between the left character (or the beginning of the alphabet) and the right character:

abcde ~ abchi  ->  abc  +  d ~ h  ->  abcf  
abc   ~ abchi  ->  abc  +  _ ~ h  ->  abcd  

Consecutive characters

If the two different characters are lexicographically consecutive, first copy the left character, and then append the character halfway between the next character from the left string and the end of the alphabet:

abhs ~ abit  ->  ab  +  h ~ i  ->  abh  +  s ~ _  ->  abhw
abh  ~ abit  ->  ab  +  h ~ i  ->  abh  +  _ ~ _  ->  abhn

If the next character(s) in the left string are one or more z's, then copy them and append the character halfway between the first non-z character and the end of the alphabet:

abhz   ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  abhz  +  _ ~ _  ->  abhzn  
abhzs  ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  abhz  +  s ~ _  ->  abhzw  
abhzz  ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  ... ->  abhzz  +  _ ~ _  ->  abhzzn

Right character is a or b

You should never create a string by appending an 'a' to the left string, because that would create two lexicographically consecutive strings, inbetween which no further strings could be added. The solution is to always append an additional character, halfway inbetween the beginning of the alphabet and the next character from the right string:

abc  ~ abcah   ->  abc  +  _ ~ a  ->  abca  +  _ ~ h  ->  abcad  
abc  ~ abcab   ->  abc  +  _ ~ a  ->  abca  +  _ ~ b  ->  abcaa  +  _ ~ _  ->  abcaan  
abc  ~ abcaah  ->  abc  +  _ ~ a  ->  abca  +  _ ~ a  ->  abcaa  +  _ ~ h  ->  abcaad  
abc  ~ abcb    ->  abc  +  _ ~ b  ->  abca  +  _ ~ _  ->  abcan

Code examples

Below is a code snippet which demonstrates the method. It's a bit fiddly because JavaScript, but not actually complicated. To generate a first string, call the function with two empty strings; this will generate the string "n". To insert a string before the leftmost or after the rightmost string, call the function with that string and an empty string.

function midString(prev, next) {
    var p, n, pos, str;
    for (pos = 0; p == n; pos++) {               // find leftmost non-matching character
        p = pos < prev.length ? prev.charCodeAt(pos) : 96;
        n = pos < next.length ? next.charCodeAt(pos) : 123;
    }
    str = prev.slice(0, pos - 1);                // copy identical part of string
    if (p == 96) {                               // prev string equals beginning of next
        while (n == 97) {                        // next character is 'a'
            n = pos < next.length ? next.charCodeAt(pos++) : 123;  // get char from next
            str += 'a';                          // insert an 'a' to match the 'a'
        }
        if (n == 98) {                           // next character is 'b'
            str += 'a';                          // insert an 'a' to match the 'b'
            n = 123;                             // set to end of alphabet
        }
    }
    else if (p + 1 == n) {                       // found consecutive characters
        str += String.fromCharCode(p);           // insert character from prev
        n = 123;                                 // set to end of alphabet
        while ((p = pos < prev.length ? prev.charCodeAt(pos++) : 96) == 122) {  // p='z'
            str += 'z';                          // insert 'z' to match 'z'
        }
    }
    return str + String.fromCharCode(Math.ceil((p + n) / 2)); // append middle character
}

var strings = ["", ""];
while (strings.length < 100) {
    var rnd = Math.floor(Math.random() * (strings.length - 1));
    strings.splice(rnd + 1, 0, midString(strings[rnd], strings[rnd + 1]));
    document.write(strings + "<br>");
}

Below is a straightforward translation into C. Call the function with empty null-terminated strings to generate the first string, or insert before the leftmost or after the rightmost string. The string buffer buf should be large enough to accomodate one extra character.

int midstring(const char *prev, const char *next, char *buf) {
    char p = 0, n = 0;
    int len = 0;
    while (p == n) {                                           // copy identical part
        p = prev[len] ? prev[len] : 'a' - 1;
        n = next[len] ? next[len] : 'z' + 1;
        if (p == n) buf[len++] = p;
    }
    if (p == 'a' - 1) {                                        // end of left string
        while (n == 'a') {                                     // handle a's
            buf[len++] = 'a';
            n = next[len] ? next[len] : 'z' + 1;
        }
        if (n == 'b') {                                        // handle b
            buf[len++] = 'a';
            n = 'z' + 1;
        }
    }
    else if (p + 1 == n) {                                     // consecutive characters
        n = 'z' + 1;
        buf[len++] = p;
        while ((p = prev[len] ? prev[len] : 'a' - 1) == 'z') { // handle z's
            buf[len++] = 'z';
        }
    }
    buf[len++] = n - (n - p) / 2;                              // append middle character
    buf[len] = '\0';
    return len;
}

Average string length

The best case is when the elements are inserted in random order. In practice, when generating 65,536 strings in pseudo-random order, the average string length is around 4.74 characters (the theoretical minimum, using every combination before moving to longer strings, would be 3.71).

The worst case is when inserting the elements in order, and always generating a new rightmost or leftmost string; this will lead to a recurring pattern:

n, u, x, z, zn, zu, zx, zz, zzn, zzu, zzx, zzz, zzzn, zzzu, zzzx, zzzz...  
n, g, d, b, an, ag, ad, ab, aan, aag, aad, aab, aaan, aaag, aaad, aaab...  

with an extra character being added after every fourth string.


If you have an existing ordered list for which you want to generate keys, generate lexicographically equally-spaced keys with an algorithm like the one below, and then use the algorithm described above to generate a new key when inserting a new element.

The code checks how many charactes are needed, how many different characters are needed for the least significant digit, and then switches between two selections from the alphabet to get the right number of keys. E.g. keys with two character can have 676 different values, so if you ask for 1600 keys, that is 1.37 extra keys per two-character combination, so after each two-character key an additional one ('n') or two ('j','r') characters are appended, i.e.: aan ab abj abr ac acn ad adn ae aej aer af afn ... (skipping the initial 'aa').

function seqString(num) {
    var chars = Math.floor(Math.log(num) / Math.log(26)) + 1;
    var prev = Math.pow(26, chars - 1);
    var ratio = chars > 1 ? (num + 1 - prev) / prev : num;
    var part = Math.floor(ratio);
    var alpha = [partialAlphabet(part), partialAlphabet(part + 1)];
    var leap_step = ratio % 1, leap_total = 0.5;
    var first = true;
    var strings = [];
    generateStrings(chars - 1, "");
    return strings;

    function generateStrings(full, str) {
        if (full) {
            for (var i = 0; i < 26; i++) {
                generateStrings(full - 1, str + String.fromCharCode(97 + i));
            }
        }
        else {
            if (!first) strings.push(stripTrailingAs(str));
            else first = false;
            var leap = Math.floor(leap_total += leap_step);
            leap_total %= 1;
            for (var i = 0; i < part + leap; i++) {
                strings.push(str + alpha[leap][i]);
            }
        }
    }
    function stripTrailingAs(str) {
        var last = str.length - 1;
        while (str.charAt(last) == 'a') --last;
        return str.slice(0, last + 1);
    }
    function partialAlphabet(num) {
        var magic = [0, 4096, 65792, 528416, 1081872, 2167048, 2376776, 4756004,
                     4794660, 5411476, 9775442, 11097386, 11184810, 22369621];
        var bits = num < 13 ? magic[num] : 33554431 - magic[25 - num];
        var chars = [];
        for (var i = 1; i < 26; i++, bits >>= 1) {
            if (bits & 1) chars.push(String.fromCharCode(97 + i));
        }
        return chars;
    }

}
document.write(seqString(1600).join(' '));
Cuspidation answered 12/8, 2016 at 22:15 Comment(22)
It's quite hard to follow all the rules here.. can you write code that implements what you're suggesting, so it's clearer if there are missing cases? Some cases where I couldn't tell what happens: ("a", "b"), ("a", "ab"), ("a", "aaa"), ("a", "aaaaa").Sabinesabino
@PaulHankin I'll try to clarify the answer. In the meantime: "a"+"b"="am", "a"+"ab"="aam". A case like "a"+"aaa", where the second string equals the first plus one or more a's added, should be avoided.Patrilineage
Nice answer! This is more or less what I had in mind as a generalization of my ab-only answer but I was afraid that would become a lengthy story. Seems that gut feeling was right :)Depressomotor
@VincentvanderWeele Thanks. I'm looking into distilling all the cases into a few rules, and it's coming along nicely. I may add a code example tomorrow.Patrilineage
This is so cool! Generating 10k strings in random order, the average length is 3.87 characters & max is 7. In the pathological case, inserting 10k strings always at the beginning, the max string length is 2500. Can one express the idea where you know you’re going to generate N strings sequentially (one after the other), and after those N, further insertions can be assumed to be in a random position? That way, if N=5000 (you first generate 5k strings "sequentially") and then generate another 5000 strings, now each randomly between the original 5k, you still get a max length of ~7?Paleography
In response to my previous comment, I guess you could run the algorithm with random order 5000 times and use the resulting list as your “sequential” list. You could maybe do better (in terms of shorter keys and better potential for shorter keys down the road) using something more sophisticated than this but this would be a great start.Paleography
@AhmedFasih Yes, if you're converting an already ordered list, it's probably best to generate evenly spread-out keys for the data you already have, and then use the method above when inserting additional keys. (Note: if you generate keys with another method first, it's important to make sure no keys end in 'a'; otherwise, you might create a situation where no - or only a limited number of - new keys can be generated to fit into a certain gap.)Patrilineage
😍! Is there a way, for a given N, to space out N “equidistant” strings such that subsequent random insertions will minimize the max key length? I mean, any way better than randomly calling your function N times like you do in your example snippet—or generating N UUIDs?Paleography
@AhmedFasih You could treat the keys as numbers in base-26; e.g. if you needed between 676 and 17576 keys, use 3 characters, and convert the numbers 1, 2, 3 ... x (17576 / N) to base-26. (Again, avoiding 'a' at the end of keys.)Patrilineage
@AhmedFasih I added a javaScript code example for an equidistant string generator, which is optimised to work with the midString() algorithm.Patrilineage
I made a library that uses your basic idea (alternative non-base-10 number systems) and wrote a library around it: github.com/fasiha/mudderjs it was a really fun project!Paleography
@AhmedFasih Wow, that's quite a project! The documentation alone is like a novel :-)Patrilineage
@m69''snarkyandunwelcoming'' I am thinking of using midString() js in a trello style card and list app like the following: The first card gets index "c". The second card gets index midstring("c", ""). When a user reorders a card, the card's index changes to midString(aboveCard.index, belowCard.index). Is there any way, that after a million reorder's, (each index string is very long) now, that hypothetically a midString insert could cause a card to be inserted in the wrong position, there is no lexicographically middle string for two specific card indexes?Stichomythia
@Redline The algorithm is specifically designed to always make further insertions possible, so you should never run into the problem you describe. There is however the issue of the strings potentially becoming very long after a while, especially if the user repeats the same reordering (e.g. moving cards to the first or last position) many times, or if you always add new cards in last position and the user usually keeps them there. When you notice the strings becoming to long, you could generate new strings for all the cards using something like the last code example in my answer.Patrilineage
I've made a simple C# port of m69's algorithm: nuget.org/packages/StringBetweenCiro
I suspect the "right character is a or b" case can be removed if the "basic case" zero-extends (or, in this case, 'a'-extends) the left string.Doggett
Additionally, in the "consecutive characters" case, if the right string does not terminate at the character, just using that character will probably be simpler (i.e. use the fact that 0.xxx < 1 < 1.yyy). Assumes the right string doesn't end in 0s, but if it does, things go wrong eventually anyway.Doggett
I believe I have found a bug in midString, in the process of porting it to Python and running tests with the Hypothesis test framework. If you give in the inputs "a" and "aaa", it returns "aaan", which sorts AFTER the second argument instead of before it. I verified that this happens in the Javascript implementation given here as well. It should be able to return a result like "an" but I haven't yet figured out the fix.Janae
@ChristopherArmstrong As the answer explains, you should never use a string that ends with an 'a'. Both midString() and seqString() never return a string ending with an 'a'. So the situation of the input strings being 'a' and 'aaa' should never arise. (When you use strings generated by another method, you should make sure they follow this rule.)Patrilineage
An equally simple one-to-one port to Golang: go.dev/play/p/9fu3o11MbeVUropygium
@m69''snarkyandunwelcoming'' how did you even think of this, likely even long before the jira lexorank version, and what is your approach to problem solving and algorithms more generally? Did you do something like first write down the problem constraints, then create a basic working version through trial and error, then consider edge cases like adding an a, then slowly updating your code to handle them? What was the process here? And also, that is the most advanced string traversal I've ever seen. it's been awhile since I saw a while loop and this code took more than a few hours to understandLaidlaw
What does abcde ~ abchi -> abc + d ~ h mean in the first example?Twospot
D
5

This is a very simple way to achieve this and probably far from optimal (depending on what you call optimal of course).

I use only a and b. I suppose you could generalise this to use more letters.

Two simple observations:

  1. Creating a new string that comes after another string is easy: just append one or more letters. E.g., abba < abbab.
  2. Creating a new string that comes before another string x is only always guaranteed to be possible if x ends with b. Now, replace that b by an a and append one or more letters. E.g., abbab > abbaab.

The algorithm is now very simple. Start with a and b as sentinels. Inserting a new key between two existing keys x and y:

  • If x is a prefix of y: the new key is y with the ending b replaced by ab.
  • If x is not a prefix of y: the new key is x with a b appended.

Example run:

a, b
a, ab*, b
a, aab*, ab, b
a, aab, ab, abb*, b
a, aab, ab, abab*, abb, b
a, aaab*, aab, ab, abab, abb, b
Depressomotor answered 12/8, 2016 at 18:11 Comment(4)
"aa" is between "a" and "aaa", but your answer suggests this is impossible.Sabinesabino
@PaulHankin Using "aa" or "aaa" means painting yourself into a corner, because "a", "aa", "aaa" ... are lexicographically consecutive, so you can't insert anything between them later.Patrilineage
@PaulHankin the question is not: please tell me where to insert aa, the question is to generate a new key in between two existing keys. Every key this algorithm generates starts with an a and ends with a b, for the reason mentioned by @m69Depressomotor
This is the most general solution because any string can be converted theoretically into a binary number.Codel
N
3

Here is an equivalent function to m69's answer implemented directly in my PostgreSQL database, with PL/pgSQL:

create or replace function app_public.mid_string(prev text, next text) returns text as $$
declare
  v_p int;
  v_n int;
  v_pos int := 0;
  v_str text;
begin
  LOOP -- find leftmost non-matching character
    v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    v_n := CASE WHEN v_pos < char_length(next) THEN ascii(substring(next from v_pos + 1)) ELSE 123 END;
    v_pos := v_pos + 1;
    EXIT WHEN NOT (v_p = v_n);
  END LOOP;
  v_str := left(prev, v_pos-1);   -- copy identical part of string
  IF v_p = 96 THEN                -- prev string equals beginning of next
    WHILE v_n = 97 LOOP           -- next character is 'a'
      -- get char from next
      v_n = CASE WHEN v_pos < char_length(next) THEN ascii(substring(next from v_pos + 1)) ELSE 123 END;
      v_str := v_str || 'a';      -- insert an 'a' to match the 'a'
      v_pos := v_pos + 1;
    END LOOP;
    IF v_n = 98 THEN              -- next character is 'b'
      v_str := v_str || 'a';      -- insert an 'a' to match the 'b'
      v_n := 123;                 -- set to end of alphabet
    END IF;
  ELSIF (v_p + 1) = v_n THEN    -- found consecutive characters
    v_str := v_str || chr(v_p); -- insert character from prev
    v_n = 123;                  -- set to end of alphabet
    v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    WHILE v_p = 122 LOOP
      v_pos := v_pos + 1;
      v_str := v_str || 'z';    -- insert 'z' to match 'z'
      v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    END LOOP;
  END IF;
  return v_str || chr(ceil((v_p + v_n) / 2.0)::int);
end;
$$ language plpgsql strict volatile;

Tested with this function:

create or replace function app_public.test() returns text[] as $$
declare
  v_strings text[];
  v_rnd int;
begin
  v_strings := array_append(v_strings, app_public.mid_string('', ''));

  FOR counter IN 1..100 LOOP
    v_strings := v_strings || app_public.mid_string(v_strings[counter], '');
  END LOOP;
  return v_strings;
end;
$$ language plpgsql strict volatile;

Which results in:

"strings": [
  "n",
  "u",
  "x",
  "z",
  "zn",
  "zu",
  "zx",
  "zz",
  "zzn",
  "zzu",
  "zzx",
  "zzz",
  "zzzn",
  "zzzu",
  "zzzx",
  "zzzz",
  "...etc...",
  "zzzzzzzzzzzzzzzzzzzzzzzzn",
  "zzzzzzzzzzzzzzzzzzzzzzzzu",
  "zzzzzzzzzzzzzzzzzzzzzzzzx",
  "zzzzzzzzzzzzzzzzzzzzzzzzz",
  "zzzzzzzzzzzzzzzzzzzzzzzzzn"
]
Nematode answered 23/6, 2020 at 19:25 Comment(0)
L
2

Just in case someone need it. Here is the same algorithm in Kotlin. It works but can probably be written in a better way.

    fun midString(prev: String?, next: String?): String {
        val localPrev = prev ?: ""
        val localNext = next ?: ""
    
        var p: Int
        var n: Int
        var str: String
    
        // Find leftmost non-matching character
        var pos = 0
        do {
            p = if (pos < localPrev.length) localPrev[pos].toInt() else 96
            n = if (pos < localNext.length) localNext[pos].toInt() else 123
            pos++
        } while (p == n)
    
        str = localPrev.substring(0, pos - 1)           // Copy identical part of string
        if (p == 96) {                                  // Prev string equals beginning of next
            while (n == 97) {                           // Next character is 'a'
                n = if (pos < localNext.length) localNext[pos++].toInt() else 123 // Get char from next
                str += 'a'                              // Insert an 'a' to match the 'a'
            }
            if (n == 98) {                              // Next character is 'b'
                str += 'a'                              // Insert an 'a' to match the 'b'
                n = 123                                 // Set to end of alphabet
            }
        }
        else if (p + 1 == n) {                          // Found consecutive characters
            str += p.toChar()                           // Insert character from prev
            n = 123                                     // Set to end of alphabet
    
            p = if (pos < localPrev.length) localPrev[pos++].toInt() else 96
            while (p == 122) { // p='z'
                str += 'z'                              // Insert 'z' to match 'z'
                p = if (pos < localPrev.length) localPrev[pos++].toInt() else 96
            }
        }
        return str + ceil((p + n) / 2.0).toChar()   // Append middle character
    }
Landseer answered 4/8, 2021 at 13:57 Comment(0)
A
2

same algorithm in python:


def between(prev: str, next: str):
    """
    """
    asize = len(prev)
    bsize = len(next)
    buf = []
    p = 0
    n = 0
    idx = 0
    while (p == n):
        p = ord(prev[idx]) if idx < asize else 96  # 96 = 'a' - 1
        n = ord(next[idx]) if idx < bsize else 123  # 123 = 'z' + 1
        if (p == n):
            buf.append(chr(p))
            idx += 1

    if p == 96:
        while n == 97:
            buf.append('a')
            idx += 1
            n = ord(next[idx]) if idx < bsize else 123  # 123 = 'z' + 1
        if n == 98:
            buf.append('a')
            idx += 1
            n = 123
    elif (p + 1) == n:
        n = 123
        buf.append(chr(p))
        idx += 1
        p = ord(prev[idx]) if idx < asize else 96
        while p == 122:
            buf.append('z')
            idx += 1
            p = ord(prev[idx]) if idx < asize else 96
    buf.append(chr(n - (n - p) // 2))
    return ''.join(buf)


if __name__ == '__main__':
    lis = []
    # print(between('n', ''))
    for i in range(50000):
        if len(lis) == 0:
            lis.append(between('', ''))
            continue
        nextid = random.randint(0, len(lis))
        if nextid == 0:
            lis.insert(nextid, between('', lis[nextid]))
        elif nextid == len(lis):
            # print(nextid, len(lis), lis)
            lis.insert(nextid, between(lis[nextid - 1], ''))
        else:
            lis.insert(nextid, between(lis[nextid - 1], lis[nextid]))

Alfred answered 14/5, 2022 at 1:25 Comment(0)
C
2

Same algorithm in Dart. Also I replaced all the numbers by constants.

const startAlphabet = 97; // = 'a'.codeUnits[0];
const beforeAlphabet = startAlphabet - 1;
const endAlphabet = 122; // = 'z'.codeUnits[0];
const behindAlphabet = endAlphabet + 1;

String midString(String? prevStr, String? nextStr) {
  final prev = (prevStr ?? '').codeUnits;
  final next = (nextStr ?? '').codeUnits;

  var pos = 0, p = 0, n = 0;
  
  for (pos = 0; p == n; pos++) {
    // find leftmost non-matching character
    p = pos < prev.length ? prev[pos] : beforeAlphabet;
    n = pos < next.length ? next[pos] : behindAlphabet;
  }
  final str = prev.take(pos - 1).toList(); // copy identical part of string
  if (p == beforeAlphabet) {
    // prev string equals beginning of next
    while (n == startAlphabet) {
      // next character is 'a'
      n = pos < next.length
          ? next[pos++]
          : behindAlphabet; // get char from next
      str.add(startAlphabet); // insert an 'a' to match the 'a'
    }
    if (n == startAlphabet + 1) {
      // next character is 'b'
      str.add(startAlphabet); // insert an 'a' to match the 'b'
      n = behindAlphabet;
    }
  } else if (p + 1 == n) {
    // found consecutive characters
    str.add(p); // insert character from prev
    n = behindAlphabet;
    // while p='z'
    while (
        (p = pos < prev.length ? prev[pos++] : beforeAlphabet) == endAlphabet) {
      str.add(endAlphabet); // insert 'z' to match 'z'
    }
  }
  str.add(((p + n) / 2).ceil()); // append middle character
  return String.fromCharCodes(str);
}

And here is the test:

import 'dart:math';

void main() {
  final rand = Random();
  final strings = <String?>[null, null];
  while (strings.length < 100) {
      final rnd = rand.nextInt(strings.length - 1);
      strings.insert(rnd + 1, midString(strings[rnd], strings[rnd + 1]));
      print(strings);
  }
}
Cyler answered 12/10, 2022 at 9:14 Comment(0)
D
1

A simplification/modification of @m69's answer.

Assuming the strings don't end with "zero" (which is necessary in general because there are only a finite number of strings between s and some zero-extension of s), the strings are isomorphic to numbers in [0, 1). So I'll talk in terms of decimal, but the same principles apply for an arbitrary alphabet.

We can zero-extend the left string (0.123 = 0.123000...) and 9-extend the right string (0.123 = 0.122999...), which leads naturally to

// Copyright 2021 Google LLC.
// SPDX-License-Identifier: Apache-2.0
template <typename Str, typename Digit>
Str midpoint(const Str left, const Str right, Digit zero, Digit nine) {
  Str mid;
  for (auto i = left.size() - left.size();; ++i) {
    Digit l = i < left.size() ? left[i] : zero;
    Digit r = i < right.size() ? right[i] : nine;
    if (i == right.size() - 1) --r;
    // This is mid += (l + r + 1)/2
    // without needing Digit to be wider than nine.
    r -= l;
    mid += l + r/2 + (r&1);
    if (mid.back() != l) break;
  }
  return mid;
}
Doggett answered 15/6, 2021 at 8:33 Comment(0)
I
1

F# implementation of algorithm provided by m69 ''snarky and unwelcoming'':

/// Returns a string that sorts 'midway' between the provided strings
/// to allow ordering of a list of items.
/// Pass None for one or both of the strings, as the case may be, to
/// sort before or after a single item, or if it is the first item in the list.
let midString (s1O : Option<string>) (s2O : Option<string>) =
  let firstSymbol = 'a' |> int
  let lastSymbol = 'z' |> int
  let middleSymbol = (firstSymbol + lastSymbol + 1) / 2

  let halfwayToFirstFrom c = (firstSymbol + c) / 2
  let halfwayToLastFrom c = (c + lastSymbol + 1) / 2
  let halfwayBetween c1 c2 = (c1 + c2 + 1) / 2

  let stringToIntList = Seq.toList >> List.map int
  let reverseAndMakeString = List.map char >> Seq.rev >> System.String.Concat

  let rec inner acc l1 l2 =
    match l1, l2 with
    | head1::tail1, head2::tail2 ->
        if head1 = head2 then inner (head1::acc) tail1 tail2          // keep looking for first difference
        elif head2 - head1 = 1 then inner (head1::acc) tail1 []       // tail2 no longer relevant, already sorting before it
        elif head2 - head1 > 1 then (halfwayBetween head1 head2)::acc // done
        else failwith "unreachable"
    | head1::tail1, [] ->                   // find the correct string to sort after s1 (already sorting before s2)
        if head1 = lastSymbol then
          inner (head1::acc) tail1 []       // already on last character in alphabet at this position, move to next position
        else (halfwayToLastFrom head1)::acc // suitable character is available - done.
    | [], head2::tail2 ->                              // strings were identical for whole of first string
        if halfwayToFirstFrom head2 = firstSymbol then
          inner (firstSymbol::acc) [] tail2            // no space in alphabet, move to next position
        else (halfwayToFirstFrom head2)::acc           // done.
    | [], [] -> middleSymbol::acc

  match s1O, s2O with
    | None, None -> [middleSymbol]
    | Some s1, Some s2 ->
        if s1 < s2 then inner [] (stringToIntList s1) (stringToIntList s2)
        else failwith "Invalid input - s1 must sort before s2"
    | Some s1, None -> inner [] (stringToIntList s1) (stringToIntList "")
    | None, Some s2 -> inner [] (stringToIntList "") (stringToIntList s2)
  |> reverseAndMakeString




/// Tests of examples provided above, and some extras.
let testsData = [
    (Some "abcde", "abcf"  , Some "abchi" )
    (Some "abc"  , "abcd"  , Some "abchi" )
    (Some "abhs" , "abhw"  , Some "abit"  )
    (Some "abh"  , "abhn"  , Some "abit"  )
    (Some "abhz" , "abhzn" , Some "abit"  )
    (Some "abhzs", "abhzw" , Some "abit"  )
    (Some "abhzz", "abhzzn", Some "abit"  )
    (Some "abc"  , "abcad" , Some "abcah" )
    (Some "abc"  , "abcaan", Some "abcab" )
    (Some "abc"  , "abcaad", Some "abcaah")
    (Some "abc"  , "abcan" , Some "abcb"  )
    (Some "abc"  , "n"     , None         )
    (Some "n"    , "t"     , None         )
    (Some "t"    , "w"     , None         )
    (Some "w"    , "y"     , None         )
    (Some "y"    , "z"     , None         )
    (Some "z"    , "zn"    , None         )
    (None        , "g"     , Some "n"     )
    (None        , "d"     , Some "g"     )
    (None        , "b"     , Some "d"     )
    (None        , "an"    , Some "b"     )
    (None        , "ag"    , Some "an"    )
    (None        , "ad"    , Some "ag"    )
    (None        , "ab"    , Some "ad"    )
    (None        , "aan"   , Some "ab"    )
     ]

testsData
|> List.map (fun (before, expected, after) ->
     let actual = midString before after
     printfn $"Before, after, expected, actual, pass:  {(before, after, expected, actual, actual = expected)}"
     actual = expected )


Immunize answered 8/10, 2021 at 10:12 Comment(0)
P
1

As I understand it, the format of the strings can be set arbitrarily. I would rather rely on the fact that, for decimal fractions (i.e. decimal numbers < 1), a < b applies equivalently for the numerical order and the lexicographic order. There is an order-preserving bijection between the strings and the numbers.

0
0.2
0.225
0.3
0.45
0.7
0.75
...

To insert a string between two existing strings while preserving the lexicographic order, we can just:

  1. Convert the strings to floating point numbers
  2. Add half the difference between the two numbers (or between the number and 1 or 0 if we want to append at the end or at the beginning, respectively)
  3. Convert the resulting floating point number to string

In Javascript:

function getLexicographicInsert(a, b) {
    const x = a ? parseFloat(a) : 0;
    const y = b ? parseFloat(b) : 1;
    return `${x + (y - x) / 2}`;
}
Paleozoic answered 6/12, 2021 at 15:1 Comment(1)
The "problem" is that this does not offer arbitrary/infinite precision -- if you keep inserting items, eventually you may reach a point where "finding the number in-between X and Y" is impossible due to the precision of Javascript's floating-point numbers. Strings do not have this problem (since they can be arbitrarily long).Selfmortification
M
1

Here is Java adaptation of algorithm above

/**
     * Generates string lexicographically ordered between two given strings.
     *
     * @param left - left string in lexicographical order
     * @param right - right string in lexicographical order
     * @return string of minimal length that is between left and right lexicographically
     */
    public String getBetween(String left, String right) {
        final char ALPHABET_LEFT = 'a' - 1;
        final char ALPHABET_RIGHT = 'z' + 1;
        char l = 'a';
        char r = 'a';
        int pos;
        StringBuilder str;
        for (pos = 0; l == r; pos++) {                                  // find leftmost non-matching character
            l = pos < left.length() ? left.charAt(pos) : ALPHABET_LEFT;
            r = pos < right.length() ? right.charAt(pos) : ALPHABET_RIGHT;
        }
        str = new StringBuilder(left.substring(0, pos - 1));            // copy identical part of string
        if (l == ALPHABET_LEFT) {                                       // prev string equals beginning of next
            while (r == 'a') {                                          // next character is 'a'
                r = pos < right.length() ? right.charAt(pos++) : ALPHABET_RIGHT;   // get char from next
                str.append('a');                                        // insert an 'a' to match the 'a'
            }
            if (r == 'b') {                                             // next character is 'b'
                str.append('a');                                        // insert an 'a' to match the 'b'
                r = ALPHABET_RIGHT;                                     // set to end of alphabet
            }
        }
        else if (l + 1 == r) {                                          // found consecutive characters
            str.append(l);                                              // insert character from prev
            r = ALPHABET_RIGHT;                                         // set to end of alphabet
            while ((l = pos < left.length() ? left.charAt(pos++) : ALPHABET_LEFT) == 'z') {  // p='z'
                str.append('z');                                        // insert 'z' to match 'z'
            }
        }
        return str.toString() + (char)(Math.ceil((l + r) / 2.0));       // append middle character
    }
Monteith answered 14/6, 2022 at 13:5 Comment(0)
U
1

In case anyone needs it, here's a working implementation of m69's algorithm in Go (playground link):

package main

import (
    "fmt"
    "math/rand"
    "strings"
)

func main() {
    generatedStrings := []string{"", ""}

    i := 0
    for {
        if i >= 100 {
            break
        }
        rnd := rand.Intn(len(generatedStrings) - 1)

        generatedStrings = append(
            generatedStrings[:(rnd+1)],
            append(
                []string{midString(generatedStrings[rnd], generatedStrings[rnd+1])},
                generatedStrings[(rnd+1):]...,
            )...,
        )

        fmt.Printf("[\"%s\"]\n", strings.Join(generatedStrings, "\", \""))

        i++
    }
}

func midString(previousString, nextString string) string {
    p := rune(0)
    n := rune(0)

    prev := []rune(previousString)
    next := []rune(nextString)

    var res []rune

    preA := 'a' - 1
    postZ := 'z' + 1

    for {
        if p != n {
            break
        }

        p = preA
        if len(prev) > 0 && len(res) < len(prev) && prev[len(res)] > 0 {
            p = prev[len(res)]
        }

        n = postZ
        if len(next) > 0 && len(res) < len(next) && next[len(res)] > 0 {
            n = next[len(res)]
        }

        if p == n {
            res = append(res, p)
        }
    }
    if p == preA {
        for {
            if n != 'a' {
                break
            }
            res = append(res, 'a')

            n = postZ
            if len(next) > 0 && len(res) < len(next) && next[len(res)] > 0 {
                n = next[len(res)]
            }
        }
        if n == 'b' {
            res = append(res, 'a')
            n = postZ
        }
    } else if n == p+1 {
        n = postZ
        res = append(res, p)
        for {
            p = preA
            if len(prev) > 0 && len(res) < len(prev) {
                p = prev[len(res)]
            }
            if p != 'z' {
                break
            }
            res = append(res, 'z')
        }
    }

    res = append(res, n-(n-p)/2)

    return string(res)
}
Uropygium answered 11/7, 2022 at 0:42 Comment(0)
R
1

Same algorithm in C#:

public string MidString(string prev, string next)
{
    int p = 0;
    int n = 0;
    int pos = 0;
    string str;

    // find leftmost non-matching character
    for (pos = 0; p == n; pos++)
    {
        p = pos < prev.Length ? prev[pos] : 96;
        n = pos < next.Length ? next[pos] : 123;
    }
    // copy identical part of string
    str = prev.Substring(0, pos - 1);
    // prev string equals beginning of next
    if (p == 96)
    {
        // next character is 'a'                           
        while (n == 97)
        {
            // get char from next
            n = pos < next.Length ? next[pos++] : 123;
            // insert an 'a' to match the 'a'
            str += 'a';
        }
        // next character is 'b'
        if (n == 98)
        {
            // insert an 'a' to match the 'b'
            str += 'a';
            // set to end of alphabet
            n = 123;
        }
    }
    // found consecutive characters
    else if (p + 1 == n)
    {
        // insert character from prev
        str += ((char)p);
        // set to end of alphabet
        n = 123;
        while ((p = pos < prev.Length ? prev[pos++] : 96) == 122)
        {
            // insert 'z' to match 'z'
            str += 'z';
        }
    }
    // append middle character
    return str + (char)(Math.Ceiling((p + n) / 2D));
}
Rework answered 2/11, 2022 at 9:52 Comment(0)
J
0

I write a PHP sample that works fine:

function charCodeAt($str, $index): int
{
    $utf16 = mb_convert_encoding($str, 'UTF-16LE', 'UTF-8');
    return ord($utf16[$index * 2]) + (ord($utf16[$index * 2 + 1]) << 8);
}

function midString($prev, $next): string
{
    $p = 0;
    $n = 0;
    // find leftmost non-matching character
    for ($pos = 0; $p == $n; $pos++) {
        $p = $pos < strlen($prev) ? charCodeAt($prev, $pos) : 96;
        $n = $pos < strlen($next) ? charCodeAt($next, $pos) : 123;
    }
    
    // copy identical part of string
    $str = substr($prev, 0, $pos - 1);

    // prev string equals beginning of next
    if ($p == 96) {
        // next character is 'a'
        while ($n == 97) {
            // get char from next
            $n = $pos < strlen($next) ? charCodeAt($next, $pos++) : 123;
            // insert an 'a' to match the 'a'
            $str .= 'a';
        }
        // next character is 'b'
        if ($n == 98) {
            // insert an 'a' to match the 'b'
            $str .= 'a';
            // set to end of alphabet
            $n = 123;
        }
    } else {
        // found consecutive characters
        if ($p + 1 == $n) {
            // insert character from prev
            $str .= chr($p);
            // set to end of alphabet
            $n = 123;
            while (($p = $pos < strlen($prev) ? charCodeAt($prev, $pos++) : 96) == 122) {
                // insert 'z' to match 'z'
                $str .= 'z';
            }
        }
    }
    return $str . chr(ceil(($p + $n) / 2));
}
midString('aaa', 'aaab'); // returns 'aaaan'
Jehias answered 24/5, 2023 at 10:46 Comment(0)
K
0

The rust crate: midstring

Just in case anyone needs the rust version, I have created a crate (based on the m69's answer) for easy use: https://crates.io/crates/midstring

Example:

cargo add midstring
// main.rs file

use midstring::mid_string;

fn main() {
    println!("_, i => {}", mid_string("", "i")); // -> e
    println!("_, b => {}", mid_string("", "b")); // -> an
    println!("_, _ => {}", mid_string("", "")); // -> n

    assert_eq!(mid_string("aaa", "aaz"), "aan");

    let left = String::from("abc");
    let right = "abcab".to_string();
    let should_be = String::from("abcaan");
    assert_eq!(mid_string(&left, &right), should_be);
}

The rust code

If you want to use the function directly in your code:

pub fn mid_string(prev: &str, next: &str) -> String {
    let prev_bytes = prev.to_string().as_bytes().to_vec();
    let next_bytes = next.to_string().as_bytes().to_vec();
    let buf_bytes = the_original_algorith_with_ascii_digits(prev_bytes, next_bytes);
    String::from_utf8(buf_bytes).unwrap()
}

/// Converted from the original code provided by "m69 snarky and unwelcoming"
fn the_original_algorith_with_ascii_digits(prev: Vec<u8>, next: Vec<u8>) -> Vec<u8> {
    // first add the null pointer at the end
    let mut prev = prev.clone();
    let mut next = next.clone();
    prev.push(0);
    next.push(0);
    let mut p: u8 = 0;
    let mut n: u8 = 0;
    let mut len: usize = 0;
    let mut buf: Vec<u8> = Vec::new();

    while p == n {
        // copy identical part
        p = if prev[len] != 0 { prev[len] } else { A - 1 };
        n = if next[len] != 0 { next[len] } else { Z + 1 };
        if p == n {
            buf.push(p);
            len += 1;
        }
    }

    if p == (A - 1) {
        // end of left string
        while n == A {
            // handle a's
            buf.push(A);
            len += 1;
            n = if next[len] != 0 { next[len] } else { Z + 1 };
        }
        if n == B {
            // handle b
            buf.push(A);
            len += 1;
            n = Z + 1;
        }
    } else if (p + 1) == n {
        // consecutive characters
        n = Z + 1;
        buf.push(p);
        len += 1;
        p = if prev[len] != 0 { prev[len] } else { A - 1 };
        let mut check: bool = p == Z;
        while check {
            // handle z's
            buf.push(Z);
            len += 1;
            p = if prev[len] != 0 { prev[len] } else { A - 1 };
            check = p == Z;
        }
    }
    let middle_char = n - (n - p) / 2; // append middle character
    buf.push(middle_char);
    // buf.push(0);
    buf
}

Kirstiekirstin answered 22/7 at 9:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.