Efficient algorithm to find all "character-equal" strings?
Asked Answered
H

1

5

How can we write an efficient function that outputs "homoglyph equivalents" of an input string?

Example 1 (pseudo-code):

homoglyphs_list = [
                     ["o", "0"], // "o" and "0" are homoglyphs
                     ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
                  ]

input_string    = "someinput"
output          = [
                   "someinput", "s0meinput", "somelnput",
                   "s0melnput", "some1nput", "s0me1nput"
                  ]

Example 2:

homoglyphs_list = [
                     ["rn", "m", "nn"],
                  ]

input_string    = "rnn"
output          = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

Example 3:

homoglyphs_list = [
                     ["d", "ci", "a"], // "o" and "0" are homoglyphs
                     ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
                  ]
                  /*
                     notice that with the two rules above,
                     we can infer "d" = "ci" = "a" = "cl" = "c1"
                  */

input_string    = "pacerier"
output          = [
                   "pacerier",  "pacerler",  "pacer1er",  "pcicerier",
                   "pcicerler", "pcicer1er", "pclcerier", "pc1cerier",
                   "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er",
                   "pdcerier",  "pdcerler",  "pdcer1er"
                  ]

Note: The order of the members in the output array isn't important, and we can assume that the given homoglyph mappings are assumed to be proper (inputs wouldn't give us an "infinite loop").

My current algorithm works, but it's using raw bruteforcing and performance is awful. E.g. an input of "mmmmm" with homoglyphs ["rn", "m", "nn"] takes 38 seconds to run:

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic

public function Func($in, Array $mappings){
    $out_table = array();
    $out_table[$in] = null;
    while(true){
        $number_of_entries_so_far = count($out_table);
        foreach(array_keys($out_table) as $key){
            foreach($mappings as $mapping){
                foreach($mapping as $value){
                    for($a=0, $aa=strlen($key); $a<$aa; ++$a){
                        $pos = strpos($key, $value, $a);
                        if($pos === false){
                            continue;
                        }
                        foreach($mapping as $equal_value){
                            if($value === $equal_value){
                                continue;
                            }
                            $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null;
                        }
                    }

                }
            }
        }
        if($number_of_entries_so_far === count($out_table)){
            // it means we have tried bruteforcing but added no additional entries,
            // so we can quit now
            break;
        }
    }
    return array_keys($out_table);
}

How can we implement an efficient (fast) homoglyph expansion algorithm?

Hang answered 5/8, 2013 at 13:57 Comment(4)
And what it will do if I write like $homoglyph_mappings[0] = array("n", "nn" , "nnn"); ??Bardwell
@Rajan, As stated, we can assume the inputs are proper (i.e. upon improper inputs, we get undefined behavior). Your example is a case of improper input because it would give us an infinite loop. If n = nn, then nn = nnnn, then nnnn = nnnnnnnn, then nnnnnnnn = nnnnnnnnnnnnnnnn, etc, etc, infinitely...Hang
Okay so It is an exceptional case.Bardwell
@Rajan, Yep, it's easy to spot infringers though, a candidate "x" cannot be declared equivalent to another candidate which contains "x".Hang
L
1

You should gain some performance out of a recursive implementation, I haven't put much thought into actual performance though. This would keep from looping though the output string multiple times and counting the output every iteration. Also I've added a little "cache" to prevent from calculating the same homoglyph twice, for simplicity it does not check against the mappings, but you could implement it to, or simply clear the cache before every call where mappings change(but that's ugly and easily introduces bugs).

Code looks like:

cache = {}
def homoglyph(input_string, mappings):
    output = []
    character = input_string[0]
    if input_string in cache:
        return cache[input_string]
    elif not input_string:
        return []
    for charset in mappings:
        if character in charset:
            temp = input_string
            sub_homoglyphs = homoglyph(input_string[1:], mappings)
            for char in charset:
                temp[0] = char
                output.append(temp)
                #adding the homoglyph character to the rest of the string
                for subhomoglyph in sub_homoglyphs:
                    output.append(char+subhomoglyph)
    cache[input_string] = output
    return output

(This is written with python as I'm not well versed in php syntax, I haven't actually run this so there may be errors but the gist of the logic is there)

Lippi answered 20/8, 2013 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.