How to match and highlight all terms in any order from an array of strings?
Asked Answered
H

5

24

The requirements are:

  • Find strings from an array (from here on called options) that include ALL the terms in ANY order
  • Correctly highlight the matching terms - ie. insert a string before and after each matching term - I'm using <u> and </u> here
  • Both the search query and options can be "anything"

For the sake of simplicity answers can concentrate on case-insensitive search through a list including only ASCII characters and assume that the term delimiter is a plain space - ie. a query entered as "Foo bar baz" means the search terms are ['foo', 'bar', 'baz'].

To clarify:

  • All terms must exist separately from each other in matched options - ie. shorter terms should not exist only as substrings of longer terms and no two terms should overlap
  • Repeated search terms must exist in the option at least as many times as in the query

The final application is (perhaps not surprisingly) an autocomplete of sorts.

TL;DR

Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)

jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlight

At this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.

There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.

Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.

Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.

Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.


Viral Permutations

In theory this can be solved by "manually" constructing a RegExp that contains every possible permutation of the search terms separated by a capturing group to catch anything between terms - a search for "foo bar" results in (foo)(.*?)(bar)|(bar)(.*?)(foo).

The highlighting is then done in one pass with string.replace(). If there is any change in the string we have a match.

Demo:

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
    var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
    meta_elm = document.getElementById('viral_permutations_meta');
    res_elm = document.getElementById('viral_permutations_result');
    res_elm.innerHTML = meta_elm.innerHTML = '';
    
    t0 = performance.now();
    
    //Split query in terms at delimiter and lowercase them
    terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
        return n;
    }).map(function(term){
        return term.toLowerCase();
    });
    meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
        
    //Escape terms
    terms_esc = terms.map(function(term) {
        return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
    });
    
    //Wrap terms in in individual capturing groups
    terms_esc = terms.map(function(term) {
        return '(' + term + ')';
    });
    
    //Find all permutations
    permuted = permutate_array(terms_esc);
    
    //Construct a group for each permutation
    match_groups = [];
    for (var i in permuted) {
        match_groups.push(permuted[i].join('(.*?)'));
    }
    
    try {
        //Construct the final regex
        regex_string = match_groups.join('|');
        //Display it
        document.getElementById('viral_permutations_regex').innerHTML = regex_string;
        meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
        
        regex = new RegExp(regex_string, 'i');
    

        //Loop through each option
        for (i = 0; i < options.length; i++) {
            option = options[i];

            //Replace the terms with highlighted terms
            highlighted = option.replace(regex, viral_permutations_replace);

            //If anything was changed (or the query is empty) we have a match
            if (highlighted !== option || terms.length === 0) {
                //Append it to the result list
                li = document.createElement('li');
                li.innerHTML = highlighted;
                res_elm.appendChild(li);
            }
        }

        //Print some meta
        t1 = performance.now();
        meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
    } catch(e) {
        meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
    }
}

//The replacement function
function viral_permutations_replace() {
    var i, m, j, retval, m_cased, unmatched;
    retval = '';
    
    //Make a clone of the terms array (that we can modify without destroying the original)
    unmatched = terms.slice(0);
    
    //Loop arguments between the first (which is the full match) and
    //the last 2 (which are the offset and the full option)
    for (i = 1; i < arguments.length - 1; i++) {
        m = arguments[i];
        
        //Check that we have a string - most of the arguments will be undefined
        if (typeof m !== 'string') continue;
        
        //Lowercase the match
        m_cased = m.toLowerCase();
        
        //Append it to the return value - highlighted if it is among our terms
        j = unmatched.indexOf(m_cased);
        if (j >= 0) {
            //Remove it from the unmatched terms array
            unmatched.splice(j, 1);
            retval += '<u>' + m + '</u>';
        } else {
            retval += m;
        }
    }
    return retval;
}

//Helper function to return all possible permutations of an array
function permutate_array(arr) {
    var perm, perm_intern;
    perm_intern = function(perm, pre, post, n) {
        var elem, i, j, ref, rest;
        if (n > 0) {
            for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
                rest = post.slice(0);
                elem = rest.splice(i, 1);
                perm_intern(perm, pre.concat(elem), rest, n - 1);
            }
        } else {
            perm.push(pre);
        }
    };
    perm = [];
    perm_intern(perm, [], arr, arr.length);
    return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>

Thanks to trincot for pointing out that my original version would occasionally highlight a recurring term twice - which is fixed in this snippet.

Fails because:

  • The regex grows exponentially as terms are added. At 7 terms (even single letters) it is past 250kb and my browser gives up with Error: regexp too big...

A few other noteworthy strategies that didn't work:

Capturing groups with all terms in each group:

(foo|bar)(.*)(foo|bar)

Fails because:

  • Will match options that include repeated terms - fx. The food in the food court would match though it obviously shouldn't.
  • If we "double-check" that all terms were, in fact, found it will fail to find valid matches - fx. The food in the food bar would find foo twice and never get to bar.

Negative lookaheads and backreferences:

(foo|bar|baz)(.*?)((?!\1)(?:foo|bar|baz))(.*?)((?!\1|\3)(?:foo|bar|baz))

Fails because:

  • Whenever terms are repeated in the query it will reach impossible conditions like "Find me a foo, bar or bar which is not a foo nor a bar".
  • I'm fairly certain it has other issues, but I stopped pursuing it when I realized it was logically flawed.

Positive lookaheads

(?=.*foo)(?=.*bar)(?=.*baz)

Fails because:

  • It is very difficult (if not impossible) to reliably highlight the found matches.
  • I haven't found any way to ensure all terms actually exist - ie. they may all be individually present in the option, but shorter terms may only exist as substrings of longer terms - or terms may overlap.
Hogan answered 22/9, 2017 at 12:50 Comment(2)
Note a minor issue in the "viral permutations" version: because you also test the (.*?) group matches, you might sometimes highlight a recurring term twice. Example: e nland will underline both e in Greenland.Circlet
@trincot: Thanks for pointing it out, I've made a tweak, though I doubt that strategy will make it very far ;)Hogan
H
2

Divide and Conquer

Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.

Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".

When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").

If it comes back to the longest term and cannot find another position to match it at, it will return false.

This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.

Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.

Demo:

Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice() that could be avoided completely.

function divide_and_conquer_replace(query, options, separator) {
    var terms, terms_esc;

    //The inner replacement function
    function divide_and_conquer_inner(bites, depth) {
        var this_term, i, bite, match, new_bites, found_all_others;

        depth = depth ? depth : 1;

        //Get the longest remaining term
        this_term = terms_esc[terms_esc.length - depth];

        //Loop all the bites
        for (i = 0; i < bites.length; i++) {
            bite = bites[i];

            //Reset the lastIndex since we're reusing the RegExp objects
            this_term.lastIndex = 0;

            //Check that we have a string (ie. do not attempt to match bites
            //that are already consumed)
            if (typeof bite === 'string') {
                //Find the next matching position (if any)
                while (match = this_term.exec(bite)) {
                    new_bites = (i > 0) ? bites.slice(0, i) : [];
                    if (match.index > 0) {
                        new_bites.push(bite.slice(0, match.index));
                    }
                    new_bites.push(['<u>' + match[0] + '</u>']);
                    if (this_term.lastIndex < bite.length) {
                        new_bites.push(bite.slice(this_term.lastIndex));
                    }
                    if (i < bites.length - 1) {
                        new_bites = new_bites.concat(bites.slice(i + 1));
                    }

                    if (terms_esc.length > depth) {
                        //Attempt to find all other terms
                        found_all_others = divide_and_conquer_inner(new_bites, depth + 1);

                        //If we found all terms we'll pass the modified string all the
                        //way up to the original callee
                        if (found_all_others) {
                            return found_all_others;
                        }
                        //Otherwise try to match current term somewhere else
                        this_term.lastIndex = match.index + 1;
                    } else {
                        //If no terms remain we have a match
                        return new_bites.join('');
                    }
                }
            }
        }
        //If we reach this point at least one term was not found
        return null;
    };

    // Split query in terms at delimiter
    terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;


    //Sort terms according to length - longest term last
    terms.sort(function(a, b) {
        return a.length - b.length;
    });

    //Escape terms
    //And store RegExp's instead of strings
    terms_esc = terms.map(function (term) {
        return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
    }).map(function (term) {
        return new RegExp(term, 'gi');
    });

    //Loop through each option
    return options.map(function(option){
        return divide_and_conquer_inner([option]);
    }).filter(Boolean);
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';

function divide_and_conquer(){
    var query = document.getElementById('divide_and_conquer').value;
    var res_elm = document.getElementById('divide_and_conquer_result');
    var t0 = performance.now();
    var results = divide_and_conquer_replace(query, options, separator);
    var t1 = performance.now();
    document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
    res_elm.innerHTML = '';
    for (var result of results) {
        res_elm.innerHTML += '<li>' + result + '</li>';
    }
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>

This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a...

In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.

Hogan answered 23/9, 2017 at 4:49 Comment(4)
This fails if an option is abbbba , and the input terms are ab bb ba. The problem is that the "divide" never happens at the middle two bb in this algorithm. You could fix this by adding this_term.lastIndex = match.index + 1 at the end of the while (match = ...) loop.Circlet
Thank you @Circlet - you're a wizard :) Took me a while to realize why this happens, and why the fix will work in similar cases, but of course you are completely right!Hogan
I have tried several other solutions still, with Maps, divide-and-conquer with only passing indexes, heaps, partially indexing/hashing the options, ...etc, ...etc. Nothing beats your divide and conquer solution for searching the 8000 option list. Bravo!Circlet
@Circlet Haha! Ironic considering that I thought I was miles away from a usable solution when I first started writing the question. I've changed very few functional lines of the algorithm since the original version. Thanks for the feedback! And all your help in figuring out it wasn't useless in the first place ;)Hogan
C
7

I would suggest a slight variant on the divide and conquer idea: instead of splitting the string into chunks (bites), you could "wipe out" the characters that have been matched, and perform further searches on that one string. The character to wipe out with would be the separator, as it is guaranteed to not occur in any of the terms.

Here it is:

function trincotWipeSearch(query, options, separator) {
    // Split query in terms at delimiter
    const terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;
    // Sort terms by descending size
    terms.sort( (a,b) => b.length - a.length );

    // Escape terms, and enrich with size of original term
    // and a string of the same length filled with the separator char
    const items = terms.map(term => ({
        size: term.length,
        wipe: separator.repeat(term.length), 
        regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
    }));

    function getOffsets(termIndex, text) {
        // All terms found?
        if (termIndex >= terms.length) return [];
        let match;
        const { regex, size, wipe } = items[termIndex];
        regex.lastIndex = 0;
        while (match = regex.exec(text)) {
            let index = match.index;
            // Wipe characters and recurse to find other terms
            let offsets = getOffsets(termIndex+1, 
                text.substr(0, index) + wipe + text.substr(index + size));
            if (offsets !== undefined) { 
                // Solution found, backtrack all the way
                return offsets.concat([index, index + size]);
            }
            regex.lastIndex = match.index + 1;
        }
    }

    // Loop through each option
    return options.map( option => {
        // Get the offsets of the matches
        let offsets = getOffsets(0, option);
        if (offsets) {
            // Apply the offsets to add the markup
            offsets
                .sort( (a,b) => b - a )
                .map((index, i) => {
                    option = option.substr(0, index) 
                        + (i%2 ? "<u>" : "</u>")
                        + option.substr(index);
                });
            return option;
        }
    }).filter(Boolean); // get only the non-empty results
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];

/*
 * I/O and performance measurements 
 */

function processInput() {
    var query = this.value.toLowerCase();
    const t0 = performance.now();
    const matches = trincotWipeSearch(query, options, ' ');
    const spentTime = performance.now() - t0;
    // Output the time spent
    time.textContent = spentTime.toFixed(2);
    // Output the matches
    result.innerHTML = '';
    for (var match of matches) {
        // Append it to the result list
        var li = document.createElement('li');
        li.innerHTML = match;
        result.appendChild(li);
    }
}

findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul { 
    height:300px;
    font-size: smaller;
    overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>

<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>

I have excluded DOM I/O from the time measurement.

Here is a jsfiddle comparing the two algorithms side by side. It should not be difficult to add a third algorithm to compare with other algorithms still.

When the separator could be any regular expression

...then the above function cannot be used. One way to overcome this is to introduce a "shadow" string, just as long as the option string, but with only 2 different possible characters in it (e.g. . and x):

  • One of the two would indicate that the corresponding character (i.e. at the same position) in the option string has been matched with a term, and therefore is no longer available for another term's match.

  • The other character would indicate that the corresponding character in the option string is still available for being included in a term's match.

Obviously this makes the function a tad slower, as there might be matches that need to be rejected after checking against this shadow string:

function trincotShadowMarks (query, options, separator) {
    // Split query in terms at delimiter
    const terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;
    // Sort terms by descending size
    terms.sort( (a,b) => b.length - a.length );

    // Escape terms, and enrich with size of original term
    // and a string of the same length filled with the separator char
    const items = terms.map(term => ({
        size: term.length,
        used: 'x'.repeat(term.length),
        free: '.'.repeat(term.length),
        regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
    }));

    function getOffsets(termIndex, text, shadow) {
        // All terms found?
        if (termIndex >= terms.length) return [];
        let match;
        const { regex, size, used, free } = items[termIndex];
        regex.lastIndex = 0;
        while (regex.lastIndex > -1 && (match = regex.exec(text))) {
            let index = match.index;
            // Is this match not overlapping with another match?
            if (!shadow.substr(index, size).includes('x')) {
                // Mark position as used and recurse to find other terms
                let offsets = getOffsets(termIndex+1, text,
                    shadow.substr(0, index) + used + shadow.substr(index + size));
                if (offsets !== undefined) { 
                    // Solution found, backtrack all the way
                    return offsets.concat([index, index + size]);
                }
            }
            regex.lastIndex = shadow.indexOf(free, match.index + 1);
        }
    }

    // Loop through each option
    return options.map( option => {
        // Get the offsets of the matches
        let offsets = getOffsets(0, option, '.'.repeat(option.length));
        if (offsets) {
            // Apply the offsets to add the markup
            offsets
                .sort( (a,b) => b - a )
                .map((index, i) => {
                    option = option.substr(0, index) 
                        + (i%2 ? "<u>" : "</u>")
                        + option.substr(index);
                });
            return option;
        }
    }).filter(Boolean); // get only the non-empty results
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];

/*
 * I/O and performance measurements 
 */

function processInput() {
    var query = this.value.toLowerCase();
    const t0 = performance.now();
    const matches = trincotShadowMarks(query, options, ' ');
    const spentTime = performance.now() - t0;
    // Output the time spent
    time.textContent = spentTime.toFixed(2);
    // Output the matches
    result.innerHTML = '';
    for (var match of matches) {
        // Append it to the result list
        var li = document.createElement('li');
        li.innerHTML = match;
        result.appendChild(li);
    }
}

findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul { 
    height:300px;
    font-size: smaller;
    overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>

<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
Circlet answered 23/9, 2017 at 15:39 Comment(11)
Excellent! This was actually the (basic) idea that led me to the divide and conquer strategy - it is certainly more optimal than my version, and it does solve the "shaved down" version of the question. Trouble is (in my case) that I need to support an (up front) unknown regex as the delimiter (fx someone might use /\s|\*|,/ or something like that). Given that, my original conclusion was that I would have to mark "bites" as consumed some other way than actually manipulating the string, since no character (or even sequence) is truly "safe"... Any thoughts on how to solve that?Hogan
One way would be to use a second "shadow" string, just as long as the option string, but populated with just 2 characters: one for indicating that position has been occupied by a previous match, and another for indicating that position is still available for matching. Obviously this slows down the original idea, and depending on the input it may run slower or faster than the divide and conquer algorithm. I added it here: jsfiddle.net/dqjybz6oCirclet
I think I understand what you mean, though the fiddle you link to is the same as in your answer (forgot to hit update perhaps?). Anyway, as mentioned, I'm pretty sure I understand, so I've added it myself along with skirtles method (which turns out not to perform very well in this comparison - I'm unsure if it's because I screwed something up when adapting it). The shadow string method is slightly slower than your original, but works perfectly with the regex-separator as far as I can tell. Still looks consistently faster than mine. New fiddle: jsfiddle.net/Mikk3lRo/dqjybz6o/1Hogan
Ah yes forgot to save, but yes you got the idea. I just had a some minor (maybe insignificant) improvement to update lastIndex at the end of the loop. See what I intended to refer to: jsfiddle.net/dqjybz6o/2, but your version of it is fine.Circlet
Tigregalis introduced the idea of a pre-run optimization. Since options are unlikely to change between keystrokes in a real implementation this seems reasonable. I've added a "prep" function to each method in the fiddle linked from the question. If you're still interested you can consider if it might benefit your algorithm(s) - I didn't come up with anything obvious, but I'm wiped for tonight ;)Hogan
I removed the first solution ("wipe") to allow me enough space for the one I have added now: Generalized Suffix Tree (preprocessing).Circlet
I'm kinda sorry to see you do that - I think you should roll back that change and post your new solution in a separate answer as it is an entirely new way to approach the problem. Your first solution was beautiful in it's simplicity, and has very good performance. I decided to use my own algorithm for my current project days ago, as it is by far the easiest (for me) to adapt to all the other requirements. But for the benefit of others I really think you should put the new solution in a separate answer.Hogan
OK, I will do as you suggest. I will also use the fiddle in your question to add my new solution to it -- unless you prefer doing it yourself ;-)Circlet
Your code hygiene is at least on par with mine, so you're welcome to update as you see fit ;) I've selected my own answer since it covers my situation best. I've rewarded the original bounty to skirtle as he's made a huge effort. Your solution and comments have been most helpful to me though, so I've started a new bounty that will be yours when I'm allowed to award it.Hogan
Thank you for the bounty! I plan to still post some alternative if it works out. After testing more, I could not make the suffixtree preprocessing to run in an acceptable time for the 8000 phrases (cf. my other answer), so it is probably not a good candidate for your use case. But just to let you know I am still looking into this question ;-)Circlet
For my current project I've already settled on the optimized Divide and Conquer. I started working on it about a month ago, and didn't expect to spend more than an hour or two on it... until I realized that it wasn't actually simple to solve. This SO question has been very educational for me... and the problem itself has turned out to be a lot more interesting than I first anticipated. Even though we have very good algorithms already and my current project is "closed", I'm sure more (and better) strategies exist... so by all means do post when you have something to add. I'm still interested ;)Hogan
A
5

I gave it a go but I'm not sure how much help this will be. My approach is similar to your Divide and Conquer technique.

Instead of biting off bits of the string I search for each term ahead of time and store up a collection of all the matches, recording the start and finish positions. If there aren't enough matches for a particular search term the algorithm immediately bails for that 'option'.

Once it has gathered together all of the possible matches it recursively attempts to find a combination that doesn't overlap. There's a lot of copying of data structures going on in that recursion and I suspect it could be optimized a lot better than I have it here. I can also only apologize for some of the variable names, I was struggling to come up with names that made any sense at all.

For certain test searches, such as a n a n a n a n ... it seems to cope better than the original Divide and Conquer technique but I suspect that may be just because of the early bail-out that is performed when insufficient matches are found for a particular search term. Without a large quantity of real data it's difficult to know where the really valuable optimizations would be.

function search() {
    var options = [
        'ababababa',
        'United States',
        'United Kingdom',
        'Afghanistan',
        'Aland Islands',
        'Albania',
        'Algeria',
        'American Samoa',
        'Andorra',
        'Angola',
        'Anguilla',
        'Antarctica',
        'Antigua and Barbuda',
        'Argentina',
        'Armenia',
        'Aruba',
        'Australia',
        'Austria',
        'Azerbaijan',
        'Bahamas',
        'Bahrain',
        'Bangladesh',
        'Barbados',
        'Belarus',
        'Belgium',
        'Belize',
        'Benin',
        'Bermuda',
        'Bhutan',
        'Bolivia, Plurinational State of',
        'Bonaire, Sint Eustatius and Saba',
        'Bosnia and Herzegovina',
        'Botswana',
        'Bouvet Island',
        'Brazil',
        'British Indian Ocean Territory',
        'Brunei Darussalam',
        'Bulgaria',
        'Burkina Faso',
        'Burundi',
        'Cambodia',
        'Cameroon',
        'Canada',
        'Cape Verde',
        'Cayman Islands',
        'Central African Republic',
        'Chad',
        'Chile',
        'China',
        'Christmas Island',
        'Cocos (Keeling) Islands',
        'Colombia',
        'Comoros',
        'Congo',
        'Congo, The Democratic Republic of The',
        'Cook Islands',
        'Costa Rica',
        'Cote D\'ivoire',
        'Croatia',
        'Cuba',
        'Curacao',
        'Cyprus',
        'Czech Republic',
        'Denmark',
        'Djibouti',
        'Dominica',
        'Dominican Republic',
        'Ecuador',
        'Egypt',
        'El Salvador',
        'Equatorial Guinea',
        'Eritrea',
        'Estonia',
        'Ethiopia',
        'Falkland Islands (Malvinas)',
        'Faroe Islands',
        'Fiji',
        'Finland',
        'France',
        'French Guiana',
        'French Polynesia',
        'French Southern Territories',
        'Gabon',
        'Gambia',
        'Georgia',
        'Germany',
        'Ghana',
        'Gibraltar',
        'Greece',
        'Greenland',
        'Grenada',
        'Guadeloupe',
        'Guam',
        'Guatemala',
        'Guernsey',
        'Guinea',
        'Guinea-bissau',
        'Guyana',
        'Haiti',
        'Heard Island and Mcdonald Islands',
        'Holy See (Vatican City State)',
        'Honduras',
        'Hong Kong',
        'Hungary',
        'Iceland',
        'India',
        'Indonesia',
        'Iran, Islamic Republic of',
        'Iraq',
        'Ireland',
        'Isle of Man',
        'Israel',
        'Italy',
        'Jamaica',
        'Japan',
        'Jersey',
        'Jordan',
        'Kazakhstan',
        'Kenya',
        'Kiribati',
        'Korea, Democratic People\'s Republic of',
        'Korea, Republic of',
        'Kuwait',
        'Kyrgyzstan',
        'Lao People\'s Democratic Republic',
        'Latvia',
        'Lebanon',
        'Lesotho',
        'Liberia',
        'Libya',
        'Liechtenstein',
        'Lithuania',
        'Luxembourg',
        'Macao',
        'Macedonia, The Former Yugoslav Republic of',
        'Madagascar',
        'Malawi',
        'Malaysia',
        'Maldives',
        'Mali',
        'Malta',
        'Marshall Islands',
        'Martinique',
        'Mauritania',
        'Mauritius',
        'Mayotte',
        'Mexico',
        'Micronesia, Federated States of',
        'Moldova, Republic of',
        'Monaco',
        'Mongolia',
        'Montenegro',
        'Montserrat',
        'Morocco',
        'Mozambique',
        'Myanmar',
        'Namibia',
        'Nauru',
        'Nepal',
        'Netherlands',
        'New Caledonia',
        'New Zealand',
        'Nicaragua',
        'Niger',
        'Nigeria',
        'Niue',
        'Norfolk Island',
        'Northern Mariana Islands',
        'Norway',
        'Oman',
        'Pakistan',
        'Palau',
        'Palestinian Territory, Occupied',
        'Panama',
        'Papua New Guinea',
        'Paraguay',
        'Peru',
        'Philippines',
        'Pitcairn',
        'Poland',
        'Portugal',
        'Puerto Rico',
        'Qatar',
        'Reunion',
        'Romania',
        'Russian Federation',
        'Rwanda',
        'Saint Barthelemy',
        'Saint Helena, Ascension and Tristan da Cunha',
        'Saint Kitts and Nevis',
        'Saint Lucia',
        'Saint Martin (French part)',
        'Saint Pierre and Miquelon',
        'Saint Vincent and The Grenadines',
        'Samoa',
        'San Marino',
        'Sao Tome and Principe',
        'Saudi Arabia',
        'Senegal',
        'Serbia',
        'Seychelles',
        'Sierra Leone',
        'Singapore',
        'Sint Maarten (Dutch part)',
        'Slovakia',
        'Slovenia',
        'Solomon Islands',
        'Somalia',
        'South Africa',
        'South Georgia and The South Sandwich Islands',
        'South Sudan',
        'Spain',
        'Sri Lanka',
        'Sudan',
        'Suriname',
        'Svalbard and Jan Mayen',
        'Swaziland',
        'Sweden',
        'Switzerland',
        'Syrian Arab Republic',
        'Taiwan, Province of China',
        'Tajikistan',
        'Tanzania, United Republic of',
        'Thailand',
        'Timor-leste',
        'Togo',
        'Tokelau',
        'Tonga',
        'Trinidad and Tobago',
        'Tunisia',
        'Turkey',
        'Turkmenistan',
        'Turks and Caicos Islands',
        'Tuvalu',
        'Uganda',
        'Ukraine',
        'United Arab Emirates',
        'United Kingdom',
        'United States',
        'United States Minor Outlying Islands',
        'Uruguay',
        'Uzbekistan',
        'Vanuatu',
        'Venezuela, Bolivarian Republic of',
        'Viet Nam',
        'Virgin Islands, British',
        'Virgin Islands, U.S.',
        'Wallis and Futuna',
        'Western Sahara',
        'Yemen',
        'Zambia',
        'Zimbabwe'
    ];

    var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/);

    if (!terms[0]) {
        terms = [];
    }

    document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);

    var startTime = performance.now();

    // Term counts is a map storing how many times each search term appears in the query
    var termCounts = {};

    terms.forEach(function(term) {
        termCounts[term] = (termCounts[term] || 0) + 1;
    });

    // An array of search terms with the duplicates removed
    var uniqueTerms = Object.keys(termCounts);

    // Loop through each option and map to either a highlight version or null
    options = options.map(function(optionText) {
        var matches = {},
            lastMatchIndex = {},
            option = optionText.toLowerCase();

        uniqueTerms.forEach(function(term) {
            // This array will be populated with start/end position of each match for this term
            matches[term] = [];

            // The index of the last match... which could be deduced from the matches but this is slightly easier
            lastMatchIndex[term] = -1;
        });

        var incompleteMatchTerms = uniqueTerms.slice(),
            nextMatchTerm;

        // This is probably a premature optimization but doing it this
        // way ensures we check that each search term occurs at least
        // once as quickly as possible.
        while (nextMatchTerm = incompleteMatchTerms.shift()) {
            var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);

            if (nextMatchIndex === -1) {
                // Haven't found enough matches for this term, so the option doesn't match
                if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
                    return null;
                }
            }
            else {
                // Found another match, put the term back on the queue
                // for another round
                incompleteMatchTerms.push(nextMatchTerm);
                lastMatchIndex[nextMatchTerm] = nextMatchIndex;

                matches[nextMatchTerm].push({
                    start: nextMatchIndex,
                    end: nextMatchIndex + nextMatchTerm.length
                });
            }
        }

        // Pass in the original array of terms... we attempt to highlight in the order of the original query
        var highlights = performHighlight(terms, matches);

        if (!highlights) {
            return null;
        }

        // We need the highlights sorted so that we do the replacing from the end of the string
        highlights.sort(function(h1, h2) {
            return h2.start - h1.start;
        });

        highlights.forEach(function(highlight) {
            optionText = optionText.slice(0, highlight.start)
                    + '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
                    + optionText.slice(highlight.end);
        });

        return optionText;

        function performHighlight(terms, allMatches) {
            // If there are no terms left to match we've got a hit
            if (terms.length === 0) {
                return [];
            }

            var nextTerms = terms.slice(),
                term = nextTerms.shift(),
                matches = allMatches[term].slice(),
                match;

            while (match = matches.shift()) {
                var nextMatches = {};

                // We need to purge any entries from nextMatches that overlap the current match
                uniqueTerms.forEach(function(nextTerm) {
                    var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];

                    nextMatches[nextTerm] = nextMatch.filter(function(match2) {
                        return match.start >= match2.end || match.end <= match2.start;
                    });
                });

                var highlights = performHighlight(nextTerms, nextMatches);

                if (highlights) {
                    highlights.push(match);

                    return highlights;
                }
            }

            return null;
        }
    });

    document.getElementById('results').innerHTML = options.map(function(option) {
        if (option) {
            return '<li>' + option + '</li>';
        }

        return '';
    }).join('');

    document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>

Update:

Based on feedback from Mikk3lRo in the comments I've done a bit of performance tuning and come up with this:

https://jsfiddle.net/skirtle/ndeuqn02/1/

The core algorithm is the same but I've made it much more difficult to understand, all in the name of performance. Most of the changes relate to avoiding the creation of new objects wherever possible.

As the algorithm does a lot of searching upfront for things it might never need there will always be opportunities for other algorithms to be quicker, especially in simple cases. Many of those cases could be handled separately but I haven't attempted that kind of optimisation.

In Chrome it now outperforms the other implementations in a lot of different scenarios, though that is an unfair comparison as they haven't yet been tuned in the same way. The other implementations tend to be slightly faster in Firefox for simple searches but times are all in the same ballpark.

Some particularly interesting searches:

  • a ab ba baba. I've added a new 'option' and adjusted the CSS to demonstrate this. The algorithms differ in their chosen way to perform the highlighting. My algorithm favours the order of the terms in the query rather than basing it on the length of the terms. There are further optimisations available if I don't worry about the ordering but I think they'd only help in extreme cases of overlaps.
  • t r i s t a n d a c u n h a. Note the spaces between the letters, these are 14 separate search terms. If you type this one term at a time Divide and Conquer will quickly start to struggle but it does recover towards the end. Wipe and Shadow cope well for longer but when you type the letter c they will fall off a cliff. I think it's an exponential explosion in the backtracking but I haven't confirmed that. I'm sure with a bit of work it could be addressed in simple cases but it might be trickier to fix in cases where the backtracking is caused by an unresolvable overlap.

I'm sure all the implementations could be sped up with a bit more tuning and a few carefully crafted bits of special-case handling. Which one is actually 'best' for real scenarios I'm not sure but my current feeling is that my algorithm probably only has a narrow sweet spot where it would outperform the others in a truly fair test. An algorithm that doesn't do all that searching upfront seems hard to beat for real searches.

Update 2

I've tried a different implementation of my earlier approach:

https://jsfiddle.net/skirtle/ndeuqn02/9/

Note that I've only updated my own implementation, the others remain out of date.

I thought I'd try to avoid unnecessary searches by performing them lazily rather than doing them all upfront. It still caches them so that they can be reused if the algorithm needs to backtrack. I don't know whether this makes a significant difference as performing small numbers of extra searches on short strings probably wasn't adding much overhead.

I also experimented with cutting out the function recursion. While it does seem to work I feel there's a high risk of bugs (it would need a lot of unit tests to be sure it actually does work). I'm not convinced this part was really a success because the data structures involved make it really difficult to follow. It does seem to be fast but not by enough to justify the complexity.

I also experimented with alternative ways to build up the final highlights. All that sorting and slicing seemed to be a performance drain but, again, the code gets more complicated trying to avoid it. Some of these gains might be applicable to the other algorithms though.

Another idea I've introduced here is a pre-search analysis of the query terms (dependent only on the query and not on the options). It checks whether terms can overlap and for any terms where an overlap is impossible (e.g. cat dog) it uses a much simpler algorithm to just grab the matches. This idea could potentially be applied to the other algorithms too.

As mentioned in the comments the idea of running some sort of pre-search analysis of the options is also possible but I haven't actually implemented that here. It's difficult to know what sort of search index would be most beneficial as it depends on things like memory usage and the specifics of the options. However, it might be more practical to try carrying over small amounts of information from one search to the next.

For example, if someone searches for united states there's a good chance that the last thing they typed was the final s and their previous search was united state. Two potential optimisations based on this are:

  1. The matching options for united states will be a subset of those for united state, so if we remember that list we could save having to check all the other options. This could be used for any of the algorithms.
  2. In the case of my algorithm the match caches could be retained from one search to the next. While the cache entry for state wouldn't be any use the entry for united would be exactly the same from one search to the next, allowing us to skip an expensive part of the algorithm for that term.
Aubreyaubrie answered 22/9, 2017 at 18:20 Comment(8)
Looks very promising. Certainly performs well on the small test-set. And it has given me some new ideas to work with.Hogan
Hm. I've added it alongside trincots and my own solution, and it doesn't perform well when tested this way. Take a look at jsfiddle.net/Mikk3lRo/dqjybz6o/1 ...and note that each algorithm is run many times on each keypress to get more precise measurements... that's why it's really slow to react.Hogan
@Hogan I've updated my answer. The original implementation needed optimising to get performance comparable to the others. The core algorithm is good at avoiding performance spikes in certain edge cases but I think it's unlikely to outperform the others for most real searches.Aubreyaubrie
I can't decide what's most important :p On one hand it doesn't matter if a search takes 1 or 5ms... On the other hand a real scenario might have 10K options and performance would start to matter. Avoiding a spike is obviously desirable, but so far the cases where it happens are so unlikely in a real scenario that they can almost be ignored. Anyways, I'm still playing around with it - I need to implement like a million settings to modify the behaviour too. I've added the fiddle to the question itself. And I'm still hoping others will chime in with ideas too ;)Hogan
Tigregalis introduced the idea of a pre-run optimization. Since options are unlikely to change between keystrokes in a real implementation this seems reasonable. I've added a "prep" function to each method in the fiddle linked from the question. If you're still interested you can consider if it can benefit your algorithm more than what I came up with.Hogan
@Hogan I've posted another update with some ideas I've been pondering. I haven't done anything specific to pre-run optimization because I feel that would be very specific to the data and the types of searches it needs to support and trying to cover all bases would be several chapters by itself. I don't know whether my latest implementation is really better than my previous attempts but I think it is faster in most cases. It should be noted that it is 'cheating' for simple searches though (as I've explained in the update).Aubreyaubrie
Congratulations on the bounty: well deserved!Circlet
I've thought of some of the things you mention too - especially the fact that search "n" will almost always be a subset of "n-1". If performance was a major concern I'd certainly look further into optimization. For my current project filesize, readability and compatibility (with many settings that affect search behavior) are much more important than saving a few ms in some edge cases though. Therefore I have selected my own answer. The bounty is yours in recognition of your effort - and the fact that your algorithm has clear benefits over any other proposal in certain situations.Hogan
H
2

Divide and Conquer

Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.

Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".

When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").

If it comes back to the longest term and cannot find another position to match it at, it will return false.

This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.

Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.

Demo:

Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice() that could be avoided completely.

function divide_and_conquer_replace(query, options, separator) {
    var terms, terms_esc;

    //The inner replacement function
    function divide_and_conquer_inner(bites, depth) {
        var this_term, i, bite, match, new_bites, found_all_others;

        depth = depth ? depth : 1;

        //Get the longest remaining term
        this_term = terms_esc[terms_esc.length - depth];

        //Loop all the bites
        for (i = 0; i < bites.length; i++) {
            bite = bites[i];

            //Reset the lastIndex since we're reusing the RegExp objects
            this_term.lastIndex = 0;

            //Check that we have a string (ie. do not attempt to match bites
            //that are already consumed)
            if (typeof bite === 'string') {
                //Find the next matching position (if any)
                while (match = this_term.exec(bite)) {
                    new_bites = (i > 0) ? bites.slice(0, i) : [];
                    if (match.index > 0) {
                        new_bites.push(bite.slice(0, match.index));
                    }
                    new_bites.push(['<u>' + match[0] + '</u>']);
                    if (this_term.lastIndex < bite.length) {
                        new_bites.push(bite.slice(this_term.lastIndex));
                    }
                    if (i < bites.length - 1) {
                        new_bites = new_bites.concat(bites.slice(i + 1));
                    }

                    if (terms_esc.length > depth) {
                        //Attempt to find all other terms
                        found_all_others = divide_and_conquer_inner(new_bites, depth + 1);

                        //If we found all terms we'll pass the modified string all the
                        //way up to the original callee
                        if (found_all_others) {
                            return found_all_others;
                        }
                        //Otherwise try to match current term somewhere else
                        this_term.lastIndex = match.index + 1;
                    } else {
                        //If no terms remain we have a match
                        return new_bites.join('');
                    }
                }
            }
        }
        //If we reach this point at least one term was not found
        return null;
    };

    // Split query in terms at delimiter
    terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;


    //Sort terms according to length - longest term last
    terms.sort(function(a, b) {
        return a.length - b.length;
    });

    //Escape terms
    //And store RegExp's instead of strings
    terms_esc = terms.map(function (term) {
        return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
    }).map(function (term) {
        return new RegExp(term, 'gi');
    });

    //Loop through each option
    return options.map(function(option){
        return divide_and_conquer_inner([option]);
    }).filter(Boolean);
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';

function divide_and_conquer(){
    var query = document.getElementById('divide_and_conquer').value;
    var res_elm = document.getElementById('divide_and_conquer_result');
    var t0 = performance.now();
    var results = divide_and_conquer_replace(query, options, separator);
    var t1 = performance.now();
    document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
    res_elm.innerHTML = '';
    for (var result of results) {
        res_elm.innerHTML += '<li>' + result + '</li>';
    }
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>

This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a...

In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.

Hogan answered 23/9, 2017 at 4:49 Comment(4)
This fails if an option is abbbba , and the input terms are ab bb ba. The problem is that the "divide" never happens at the middle two bb in this algorithm. You could fix this by adding this_term.lastIndex = match.index + 1 at the end of the while (match = ...) loop.Circlet
Thank you @Circlet - you're a wizard :) Took me a while to realize why this happens, and why the fix will work in similar cases, but of course you are completely right!Hogan
I have tried several other solutions still, with Maps, divide-and-conquer with only passing indexes, heaps, partially indexing/hashing the options, ...etc, ...etc. Nothing beats your divide and conquer solution for searching the 8000 option list. Bravo!Circlet
@Circlet Haha! Ironic considering that I thought I was miles away from a usable solution when I first started writing the question. I've changed very few functional lines of the algorithm since the original version. Thanks for the feedback! And all your help in figuring out it wasn't useless in the first place ;)Hogan
C
2

Here is an entirely different approach than taken in my previous answer -- which I could not add all the below to (size restriction), so... it's a separate answer.

Generalized Suffix Tree: Preprocessing the options

A generalized suffix tree is a structure that in theory allows searching substrings in a set of strings in an efficient way. So I thought I would have a go at it.

Building such a tree in an efficient way is far from trivial, as can be seen from this awesome explanation of Ukkonen's algorithm, which concerns building a suffix tree for one phrase (option).

I have taken inspiration from the implementation found here, which needed some adaptation to:

  • Apply a better coding style (e.g. get rid of non-explicitly declared global variables)
  • Make it work without needing to add delimiters after texts. This was really tricky, and I hope I did not miss out on some border conditions
  • Make it work for multiple strings (i.e. generalized)

So here it is:

"use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also: https://mcmap.net/q/45491/-ukkonen-39-s-suffix-tree-algorithm-in-plain-english/5459839
class Node {
    constructor() {
        this.edges = {};
        this.suffixLink = null;
    }
    addEdge(ch, textId, start, end, node) {
        this.edges[ch] = { textId, start, end, node };
    }
}

class Nikkonen extends Node {
    constructor() {
        super(); // root node of the tree
        this.texts = [];
    }
    findNode(s) {
        if (!s.length) return;
        let node = this,
            len,
            suffixSize = 0,
            edge;
        for (let i = 0; i < s.length; i += len) {
            edge = node.edges[s.charAt(i)];
            if (!edge) return;
            len = Math.min(edge.end - edge.start, s.length - i);
            if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
            node = edge.node;
        }
        return { edge, len };
    }
    findAll(term, termId = 1) {
        const { edge, len } = this.findNode(term) || {};
        if (!edge) return {}; // not found
        // Find all leaves
        const matches = new Map;
        (function recurse({ node, textId, start, end }, suffixLen) {
            suffixLen += end - start;
            const edges = Object.values(node.edges);
            if (!edges.length) { // leaf node: calculate the match
                if (!(matches.has(textId))) matches.set(textId, []);
                matches.get(textId).push({ offset: end - suffixLen, termId });
                return;
            }
            edges.forEach( edge => recurse(edge, suffixLen) );
        })(edge, term.length - len);
        return matches;
    }
    addText(text) { 
        // Implements Nikkonen's algorithm for building the tree
        // Inspired by https://felix-halim.net/misc/suffix-tree/
        const root = this,
            active = {
                node: root,
                textId: this.texts.length,
                start: 0,
                end: 0,
            },
            texts = this.texts;
        
        // Private functions
        function getChar(textId, i) {
            return texts[textId].charAt(i) || '$' + textId;
        }
        
        function addEdge(fromNode, textId, start, end, node) {
            fromNode.addEdge(getChar(textId, start), textId, start, end, node);
        }
        
        function testAndSplit() {
            const ch = getChar(active.textId, active.end);
            if (active.start < active.end) {
                const edge = active.node.edges[getChar(active.textId, active.start)],
                    splitPoint = edge.start + active.end - active.start;
                if (ch === getChar(edge.textId, splitPoint)) return;
                const newNode = new Node();
                addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
                addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
                return newNode;
            }
            if (!(ch in active.node.edges)) return active.node;
        }

        function canonize() {
            while (active.start < active.end) {
                const edge = active.node.edges[getChar(active.textId, active.start)]; 
                if (edge.end - edge.start > active.end - active.start) break;
                active.start += edge.end - edge.start; 
                active.node = edge.node;
            }
        }
        
        function update() {
            let prevNewNode = root,
                newNode;
            
            while (newNode = testAndSplit()) {
                addEdge(newNode, active.textId, active.end, text.length+1, new Node());
                // Rule 2: add suffix-link from previously inserted node
                if (prevNewNode !== root) {
                    prevNewNode.suffixLink = newNode; 
                }
                prevNewNode = newNode;
                // Rule 3: follow suffixLink after split
                active.node = active.node.suffixLink || root;
                canonize(); // because active.node changed
            }
            if (prevNewNode !== root) {
                prevNewNode.suffixLink = active.node;
            }
        }

        texts.push(text);

        if (!root.suffixLink) root.suffixLink = new Node(); 
        for (let i = 0; i < text.length; i++) {
            addEdge(root.suffixLink, active.textId, i, i+1, root);
        }

        // Main Ukkonen loop: add each character from left to right to the tree
        while (active.end <= text.length) {
            update();
            active.end++;
            canonize(); // because active.end changed
        }
    }
}

function trincotSuffixTree(query, options, suffixTree, separator) {
    // Split query in terms at delimiter
    const terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;
    // Sort terms by descending size
    terms.sort( (a,b) => b.length - a.length );
    
    // create Map keyed by term with count info
    const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] ));
    terms.forEach( (term) => termMap.get(term).count++ );
    
    function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) {
        // All terms found?
        if (!leftOver) return [];
        let giveUpAt = Infinity;
        // While still enough matches left over:
        while (offsetIndex + leftOver <= offsets.length) {
            const { termId, offset } = offsets[offsetIndex++];
            if (offset < lastIndex) continue; // overlap, try next
            if (offset >= giveUpAt) break; // Looking further makes no sense
            const termInfo = termMap.get(terms[termId]);
            //console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex);
            if (!termInfo.leftOver) continue; // too many of the same term, try next
            termInfo.leftOver--;
            const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex);
            // If success, then completely backtrack out of recursion.
            if (result) return result.concat([offset + termInfo.size, offset]);
            termInfo.leftOver++; // restore after failed recursive search and try next
            // If a term-match at a given offset could not lead to a solution (in recursion), 
            // and if we keep those matched character postions all unmatched and only start matching after
            // the end of that location, it will certainly not lead to a solution either.
            giveUpAt = Math.min(giveUpAt, offset + termInfo.size);
        }
    }

    let allTermsAllOptionsOffsets;
    // Loop through the unique terms:
    for (let [term, termInfo] of termMap) {
        // Get the offsets of the matches of this term in all options (in the preprocessed tree)
        const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
        //console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
        if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
        if (!allTermsAllOptionsOffsets) {
            allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
        } else {
            // Merge with all previously found offsets for other terms (intersection)
            for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
                let newOffsets = thisTermAllOptionsOffsets.get(optionId);
                if (!newOffsets || newOffsets.length < termInfo.count) {
                    // this option does not have enough occurrences of this term
                    allTermsAllOptionsOffsets.delete(optionId); 
                } else {
                    allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
                }
            }
            if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
        }
    }
    // Per option, see if (and where) the offsets can serve non-overlapping matches for each term
    const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
            // Indicate how many of each term must (still) be matched:
            termMap.forEach( obj => obj.leftOver = obj.count );
            return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
        })
        // Remove options that could not provide non-overlapping offsets
        .filter( ([_, offsets]) => offsets ) 
        // Sort the remaining options in their original order
        .sort( (a,b) => a[0] - b[1] )
        // Replace optionId, by the corresponding text and apply mark-up at the offsets
        .map( ([optionId, offsets]) => {
            let option = options[optionId];
            offsets.map((index, i) => {
                option = option.substr(0, index) 
                    + (i%2 ? "<u>" : "</u>")
                    + option.substr(index);
            });
            return option;            
        });
    //console.log(JSON.stringify(matches));
    return matches;
}

function trincotPreprocess(options) {
    const nikkonen = new Nikkonen();
    // Add all the options (lowercased) to the suffic tree
    options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
    return nikkonen;
}

const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];

/*
 * I/O and performance measurements 
 */

let preprocessed;
 
function processInput() {
    if (!preprocessed) { // Only first time
        const t0 = performance.now();
        preprocessed = trincotPreprocess(options);
        const spentTime = performance.now() - t0;
        // Output the time spent on preprocessing
        pretime.textContent = spentTime.toFixed(2);
    }
    var query = this.value.toLowerCase();
    const t0 = performance.now();
    const matches = trincotSuffixTree(query, options, preprocessed, ' ');
    const spentTime = performance.now() - t0;
    // Output the time spent
    time.textContent = spentTime.toFixed(2);
    // Output the matches
    result.innerHTML = '';
    for (var match of matches) {
        // Append it to the result list
        var li = document.createElement('li');
        li.innerHTML = match;
        result.appendChild(li);
    }
}

findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul { 
    height:300px;
    font-size: smaller;
    overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>

<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>

This method has quite some code behind it, so I suppose it might not show interesting performance for small data sets, while for larger data sets it will be memory consuming: the tree takes much more memory than the original options array.

Circlet answered 1/10, 2017 at 14:10 Comment(2)
After a quick look I'm guessing this is superior in situations with a huge (static) option set, and where complexity (and size) of the code are not major concerns... Looking forward to see how it holds up in the performance comparison :)Hogan
Working on the comparison code. I wonder if the preprocessing should not only be done only when switching data set, as that would be what you would do in reality: as long as the options don't change, you would not redo the preprocessing. Of course when the checkbox is set to measure the preprocessing with every search, the inner loop should still do it. I will make a suggestion for the fiddle -- working on it.Circlet
T
-1

Update 2

Abandoned the concept of shrinking the set due to issues with restoring the working strings in Vue.

Now, the method is simply as follows:

  1. Pre-process the option set to keep the display in sync with the working.
  2. Process the terms.
  3. Reduce (filter) the option set by iterating over it, and looping over terms, short-circuiting when there is a mismatch.
  4. Using the reduced set, iterating over each option, find the match ranges.
  5. Insert HTML strings around each match range.

The code is commented.

Raw javascript (logs the filtered/manipulated options array): https://jsfiddle.net/pvLj9uxe/14/

New Vue implementation: https://jsfiddle.net/15prcpxn/30/

The calculation seems reasonably fast - the DOM update is what kills it.

Added to comparison*: https://jsfiddle.net/ektyx133/4/

*Caveat: Pre-processing the options (treated as "static") is part of the strategy, so it has been processed outside of the benchmark.

var separator = /\s|\*|,/;

// this function enhances the raw options array 
function enhanceOptions(options) {
  return options.map(option => ({
    working: option.toLowerCase(), // for use in filtering the set and matching
    display: option // for displaying
  }))
}

// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
  return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
    value: term.toLowerCase(),
    size: term.length,
    wipe: " ".repeat(term.length)
  })).sort((a, b) => b.size - a.size);
}

// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
  let options = enhancedOptions,
    l = terms.length;

  // filter the options - consider recursion instead
  options = options.filter(option => {
    let i = 0,
      working = option.working,
      term;
    while (i < l) {
      if (!~working.indexOf((term = terms[i]).value)) return false;
      working = working.replace(term.value, term.wipe);
      i++;
    }
    return true;
  })

  // generate the display string array
  let displayOptions = options.map(option => {
    let rangeSet = [],
      working = option.working,
      display = option.display;

    // find the match ranges
    terms.forEach(term => {
      working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
        rangeSet.push({
          start: offset,
          end: offset + term.size
        });
        return term.wipe;
      })
    })

    // sort the match ranges, last to first
    rangeSet.sort((a, b) => b.start - a.start);

    // insert the html tags within the string around each match range
    rangeSet.forEach(range => {
      display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
    })

    return display;

  })

  return displayOptions;
}

Old Attempt

https://jsfiddle.net/15prcpxn/25/

My attempt, using Vue for rendering (the methods are sequential, so you can probably put it all into one monolithic function without much effort - inputs will be terms, and full option set; outputs will be filtered option set, and highlighted ranges).

  1. Split the input into individual terms
  2. Sort terms by length (longest term first, so that when you have an option such as "abc ab", and the terms "a abc", i.e. a term is a substring of another term, it will be able to match "abc")
  3. Copy/change terms to lower case
  4. Copy options (our "display set") to lower case (our "working set")
  5. For each term, remove working options without matches from the "working set", and in parallel remove display options from the "display set" - when doing so, remove the matched term strings from the surviving working option strings, e.g. matching term "a" in option "abc" yields "bc" [actual implementation is the inverse: For each term, recreate "working set" when there are matches, and in parallel add display options to the "display set", then pass these sets to the next term] - this gives us the filtered display set
  6. Copy the filtered display set to lower case, to give us a new filtered working set
  7. For each working option in the remaining filtered working set, create a range-set, by recording the ranges (i.e. start and end, e.g. matching term "a" in option "abc": start = 0, end = 1) where each term matches by taking the offset (start) of the match and the length of the term/match. Replace the matched string with empty spaces (or other unused character) of equal length to the term and feed this into the next term, e.g. matching term "a" in option "abc" yields " bc" - this preserves the length of the working option, ensuring that the filtered working set (in lower case) matches the filtered display set (original case). The total number of range-sets will be equal to the number of remaining options in the filtered option set.
  8. Also, sort the ranges within each range-set (in descending order, to work in reverse), to allow for string insertion.
  9. For each option in the filtered display set, (working in reverse so as not to disturb the index when manipulating the string), insert the <u> </u> tags around the matched ranges by slicing up the display option, e.g. matching term "a" in option "abc": new option = "<u>" + "a" + "</u>" + "bc"
  10. Render it

Performance is poor when there are many matches / non-useful terms (e.g. when you input a single character). For end-use, I'd probably put in an input-calculation delay.

I should be able to roll up some of these steps into fewer steps which may improve performance. I'll revisit tomorrow.

Vue presumably also takes care of some of the optimisations through virtual DOM etc. so it won't necessarily be reflective of vanilla Javascript / DOM rendering.

Trilogy answered 28/9, 2017 at 20:29 Comment(9)
From your description (ie. without closely examining your code) this sounds like a port of @trincots algorithm to vue, is that correct...? Or are there improvements to the algorithm itself? Also, it fails for fx. intena mant (listing all options containing maintenant) - I think because you replace with an empty string in step 5. Using a single space fixes it for a simple test...Hogan
I've reviewed trincot's code and the main commonality between the two is the string "wipe-out", but my implementation is very different. trincot uses recursion; I use iteration on an increasingly smaller (shrinking) set of options. trincot builds regexp objects and uses case-insensitive regex matching to find matches and offsets at the same time; I use string.indexOf to compare already-lowercase strings to reduce the set of options, then only find the offsets with the already-reduced set of options, keeping a "display" set and a "working" set in sync.Trilogy
the purpose of trincot's recursion is to feed the wiped string into the next string, but I achieve the same with iteration, so there's not really much difference in that sense. however, our approaches are sort of orthogonal, in a way - I loop through options for each term; trincot loops through terms for each option - my thinking is that there will always be more options than terms, so if we can reduce the set of options, then in theory we conduct less comparisons. additionally, I only determine the offsets with the final, already-reduced set.Trilogy
good catch on the empty string problem - I've replaced it with the full wipe. I really like @Circlet 's concept of "enriching" the terms with the wipe and the string length, so that we don't have to re-determine the string length and wipe in every examination of the term. In my case, I may be able to apply it to enriching the options as well (i.e. storing the lowercase strings side by side with the original strings). lots to consider.Trilogy
Considering the above, an optimisation that is worth considering, assuming that the set of options is "static" then the lowercase options might as well be static as well (in my answer with vue, it is, unless changing the set of options). direct comparisons of algorithms may not be fair then, if we need to pre-process the whole options set with each input change - actually, given an initial pre-processing of the options set in this way, it makes the problem much simpler as only the terms need to be converted to lower case, and there is no need for case-insensitive regex matching.Trilogy
I'll take a closer look at your method soon, but off the top of my head I don't see why there would be any performance gain from looping terms instead of options - all the algorithms must make at least one comparison for each option, and none of them will do any more work with that option if that first match fails... I'm still on my phone, so maybe I'll "get it" when I'm at a computer ;)Hogan
Your algorithm does indeed seem to outperform the others in some cases. I'm enhancing the comparison a bit more and trying to separate the "preparation" for each algorithm to get a more level playing field. Another thing that may impact performance is that some of the algorithms use regex'es and others just indexOf. For my particular use I need support for regex'es as there may be wildcards and so on, but I think for the comparison at least it needs to be consistent across the different methods. I'm also removing most of the dom-output as it is taking much longer than the actual searching.Hogan
@Trilogy Using the latest version a search for an ha does not find Afghanistan.Aubreyaubrie
@Aubreyaubrie Ah, yes, that would be because it wipes the first "an" in "han" before it wipes "ha" - I'll need to revisit this.Trilogy

© 2022 - 2024 — McMap. All rights reserved.