JavaScript fuzzy search that makes sense [closed]
Asked Answered
G

12

183

I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages).

After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions, deletions, and substitutions are needed to make two strings match.

One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb (each requiring two substitutions). It is clear, however, that bulb is more similar to blub than boob is, and the model I just mentioned recognizes that by allowing for transpositions.

I want to use this in the context of text completion, so if I have an array ['international', 'splint', 'tinder'], and my query is int, I think international ought to rank more highly than splint, even though the former has a score (higher=worse) of 10 versus the latter's 3.

So what I'm looking for (and will create if it doesn't exist), is a library that does the following:

  • Weights the different text manipulations
  • Weights each manipulation differently depending on where they appear in a word (early manipulations being more costly than late manipulations)
  • Returns a list of results sorted by relevance

Has anyone come across anything like this? I realize that Stack Overflow isn't the place to be asking for software recommendations, but implicit (not any more!) in the above is: am I thinking about this the right way?


Edit

I found a good paper (pdf) on the subject. Some notes and excerpts:

Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions

the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters

For the Smith-Waterman distance (wikipedia), "Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure." It's the n-gram approach.

A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix

(seem to be intended primarily for short strings)

For text completion purposes, the Monger-Elkan and Jaro-Winkler approaches seem to make the most sense. Winkler's addition to the Jaro metric effectively weights the beginnings of words more heavily. And the affine aspect of Monger-Elkan means that the necessity to complete a word (which is simply a sequence of additions) won't disfavor it too heavily.

Conclusion:

the TFIDF ranking performed best among several token-based distance metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler.

This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro-Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper.

Guayaquil answered 26/4, 2014 at 0:11 Comment(7)
Great question. I'm looking to do something similar, but with the same string comparison considerations. Did you ever find/build a javascript implementation of your string comparisons? Thanks.Sobel
@Sobel I simply forked fuzzyset.js on github to account for smaller query strings and, although it doesn't account for weighted string manipulations, the results are quite good for my intended application of string completion. See the repoGuayaquil
Thanks. I'll try it. I also found this string compare function: github.com/zdyn/jaro-winkler-js. Seems to work pretty well too.Sobel
Try this one: subtexteditor.github.io/fuzzysearch.jsPolestar
@Polestar That doesn't take typos into account. In the demo, typing krole doesn't return Final Fantasy V: Krile, though I would like it to. It requires all characters in the query to be present in the same order in the result, which is pretty short-sighted. It seems the only way to have good fuzzy search is to have a database of common typos.Guayaquil
github.com/nol13/fuzzball.js might help. It's a Javascript port of the Python fuzzywuzzy (github.com/seatgeek/fuzzywuzzy)Fulminant
A bit late to the party, but I have implemented a library that may fit your needs: github.com/m31coding/fuzzy-search. It's fast, accurate and multilingual. There is also a demo you might want to try. Happy Coding!Oliverolivera
E
32

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

  1. Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

  2. q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

  3. Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

Emalia answered 26/4, 2014 at 0:25 Comment(0)
N
165

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort

The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:

fuzzysort.go('int', ['international', 'splint', 'tinder'])
// [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]

Neogene answered 4/9, 2017 at 4:59 Comment(12)
The only problem with this library that I faced is when the word is complete but spelled incorrectly for example, if the correct word was "XRP" and If i searched "XRT" it doesnt give me a scoreDaryldaryle
@Daryldaryle yup, i don't handle misspellings (because sublime's search doesn't). i'm kind of looking into this now that people are complaining. you can provide me with example use cases where this search fails as a github issueNeogene
For those of you who are wondering about this lib, it now has spell check implemented too! I recommend this lib over fusejs and othersDaryldaryle
@PirateApp: How did you get misspellings to work? I tested it with fuzzy.single("XRT","XRP"), and it returns null.Hepatitis
@Hepatitis you have to code it yourself. checkout this thread, it has a code sample github.com/farzher/fuzzysort/issues/19Neogene
@StephenBugsKamenar Okay, thank you for the clarification. From PirateApp's comment, it sounded like spell check was available in the library, but I guess it's not.Hepatitis
I have "automation" for example, and search for "utomation", it doesn't find itFarman
The fuzzysort library is no longer supported, as far as I can tell.Singlephase
If someone wants a good corpus of words that are hard to spell correctly get a list of drug names.Chemarin
great library. but it is not matching Administrative Assistant with AdministrativeMorisco
I think Gmail should use your search algorithm - their search is terrible...Diclinous
Looks like the latest version of fuzzysort removed the transpose feature that matched strings with typosTanning
E
32

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

  1. Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

  2. q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

  3. Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

Emalia answered 26/4, 2014 at 0:25 Comment(0)
K
26

this is my short and compact function for fuzzy match:

function escapeRegExp(str: string) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

function fuzzyMatch(pattern, str) {
  pattern = '.*' + pattern.split('').map(l => `${escapeRegExp(l)}.*`).join('');
  const re = new RegExp(pattern);
  return re.test(str);
}

thanks to jt3k, iv'e updated the function to escape regex

escapeRegExp as suggested at MDN Docs

Kaufmann answered 25/3, 2019 at 13:8 Comment(8)
Although not what you want in most cases probably, it exactly was for me.Comedian
Can you make to ignore the order? fuzzyMatch('c a', 'a b c') should return trueDaradarach
One improvement here is the first 2 lines should be taken out of the function since RegExp parsing takes considerable time. I am assuming the repeated calling of this method using lots of strings i.e. str s for one pattern.Cadence
Does not escape regex. If someone searched for "(" or something this would mess up. Submitting an edit now!Brindle
@Brindle Code edits are somewhat likely to be rejected. If yours doesn't make it through, please submit an answer of your own, perhaps with credit to this answer (you can even abstain from rep gain by making your answer "community wiki" though I don't suppose it's called for here).Cytolysin
fixed one const fuzzyMatch = (pattern, str) => new RegExp('.*' + pattern.split('').map(letter => '\\' + letter + '.*').join('')).test(str);Flybynight
it worked for my use case .. Thanks for thisConfound
Thank you @jt3k, your fixed help me avoid the regex crashing when I added ")" into the search querySand
O
24

Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.

get_bigrams = (string) ->
    s = string.toLowerCase()
    v = new Array(s.length - 1)
    for i in [0..v.length] by 1
        v[i] = s.slice(i, i + 2)
    return v

string_similarity = (str1, str2) ->
    if str1.length > 0 and str2.length > 0
        pairs1 = get_bigrams(str1)
        pairs2 = get_bigrams(str2)
        union = pairs1.length + pairs2.length
        hit_count = 0
        for x in pairs1
            for y in pairs2
                if x is y
                    hit_count++
        if hit_count > 0
            return ((2.0 * hit_count) / union)
    return 0.0

Pass two strings to string_similarity which will return a number between 0 and 1.0 depending on how similar they are. This example uses Lo-Dash

Usage Example....

query = 'jenny Jackson'
names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']

results = []
for name in names
    relevance = string_similarity(query, name)
    obj = {name: name, relevance: relevance}
    results.push(obj)

results = _.first(_.sortBy(results, 'relevance').reverse(), 10)

console.log results

Also....have a fiddle

Make sure your console is open or you wont see anything :)

Olympiaolympiad answered 26/4, 2014 at 1:8 Comment(4)
Thanks, that is exactly what I was looking for. It would only be better if it was plain js ;)Armyn
function get_bigrams(string){ var s = string.toLowerCase() var v = s.split(''); for(var i=0; i<v.length; i++){ v[i] = s.slice(i, i + 2); } return v; } function string_similarity(str1, str2){ if(str1.length>0 && str2.length>0){ var pairs1 = get_bigrams(str1); var pairs2 = get_bigrams(str2); var union = pairs1.length + pairs2.length; var hits = 0; for(var x=0; x<pairs1.length; x++){ for(var y=0; y<pairs2.length; y++){ if(pairs1[x]==pairs2[y]) hit_count++; }} if(hits>0) return ((2.0 * hits) / union); } return 0.0 }Brottman
How to use this in objects in which you will want to search for in several keys?Farman
This has a few problems: 1) It underweights the characters at the beginning and end of the string. 2) The bigram comparisons are O(n^2). 3) The similarity score can be over 1 because of the implementation. This obviously makes no sense. I fix all these problems in my answer below.Nakisha
N
19

I fixed the problems with the CoffeeScript bigram solution by InternalFx and made it a generic n-gram solution (you can customize the size of the grams).

This is TypeScript but you can remove the type annotations and it works fine as vanilla JavaScript as well.

/**
 * Compares the similarity between two strings using an n-gram comparison method. 
 * The grams default to length 2.
 * @param str1 The first string to compare.
 * @param str2 The second string to compare.
 * @param gramSize The size of the grams. Defaults to length 2.
 */
function stringSimilarity(str1: string, str2: string, gramSize: number = 2) {
  function getNGrams(s: string, len: number) {
    s = ' '.repeat(len - 1) + s.toLowerCase() + ' '.repeat(len - 1);
    let v = new Array(s.length - len + 1);
    for (let i = 0; i < v.length; i++) {
      v[i] = s.slice(i, i + len);
    }
    return v;
  }

  if (!str1?.length || !str2?.length) { return 0.0; }

  //Order the strings by length so the order they're passed in doesn't matter 
  //and so the smaller string's ngrams are always the ones in the set
  let s1 = str1.length < str2.length ? str1 : str2;
  let s2 = str1.length < str2.length ? str2 : str1;

  let pairs1 = getNGrams(s1, gramSize);
  let pairs2 = getNGrams(s2, gramSize);
  let set = new Set<string>(pairs1);

  let total = pairs2.length;
  let hits = 0;
  for (let item of pairs2) {
    if (set.delete(item)) {
      hits++;
    }
  }
  return hits / total;
}

Examples:

console.log(stringSimilarity("Dog", "Dog"))
console.log(stringSimilarity("WolfmanJackIsDaBomb", "WolfmanJackIsDaBest"))
console.log(stringSimilarity("DateCreated", "CreatedDate"))
console.log(stringSimilarity("a", "b"))
console.log(stringSimilarity("CreateDt", "DateCreted"))
console.log(stringSimilarity("Phyllis", "PyllisX"))
console.log(stringSimilarity("Phyllis", "Pylhlis"))
console.log(stringSimilarity("cat", "cut"))
console.log(stringSimilarity("cat", "Cnut"))
console.log(stringSimilarity("cc", "Cccccccccccccccccccccccccccccccc"))
console.log(stringSimilarity("ab", "ababababababababababababababab"))
console.log(stringSimilarity("a whole long thing", "a"))
console.log(stringSimilarity("a", "a whole long thing"))
console.log(stringSimilarity("", "a non empty string"))
console.log(stringSimilarity(null, "a non empty string"))

Try it in the TypeScript Playground

Nakisha answered 5/6, 2020 at 13:42 Comment(3)
This is a pretty cool solution. I tried using fuse.js but it was being awkward with relevance. The problem I was having was that "Butter" and "Caster Sugar" would return extremely close scores to the search "Unsalted Butter" but yours seems to be smarter and return a more realistic score difference between them. Thanks for contributing!Attenborough
Gist here: gist.github.com/mavaddat/a522c2ed59162d6330569999cab03d76Tourney
This is the answer.Defecate
S
6

you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.

it is available on npm, has simple API, and worked ok for me.

> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int');
< ["international", "splint"]
Seduce answered 22/3, 2015 at 17:32 Comment(1)
I also had success with Atom's library, which has a simple API and lightning fast =). github.com/cliffordfajardo/catoSapper
H
6

I've been in love with fuzzy matching for ages, and just ran across this thread. The conversation here is a lot further into the weeds than most, and looks to have involved implementers. I've coded several of these algorithms in different languages down the years, and want to pass along a few tips to anyone writing JS versions:

Monge-Elkan rules!

It's just fantastic, combining many of the strengths of n-grams with the best short-string comparison algorithms, such as Jaro-Winkler. (That's what I use in my Monge-Elkan code.) A couple of years back, I ran across a paper you can find on-line as a PDF named Generalized Mongue-Elkan Method for Approximate Text String Comparison. The take-away is that rather than using an arithmetic mean, use a quadratic mean. I tried it out, and it made a significant improvement in search results, across a wide variety of text.

N-Grams Rule!

Very robust, high-quality performance across a range of source languages and text types. If you're looking at databases, it is possible to implement this as a high-quality, lightning-fast, indexed K-NN search in Postgres. It takes lining up a few different features properly, but it's not too bad.

In any case, when splitting n-grams, there are different approaches to handling front-end padding. Like, if you've got a traditional n (q or k) of 3, then do you split 'ander' like this

'  a'
' an'
'and'
'nde'
'der'
'er '
'r  '

or

'  a'
' an'
'and'
'nde'
'der'

or

'and'
'nde'
'der'

Instinctively, I've always expected the first list to work best but, in practice, it can be the second or third. It's worth experimenting with the padding and windowing rules, and see how they perform in your context. Few libraries provide control over this behavior, which would be a nice feature to support. Hint.

Howdoyoudo answered 18/4, 2021 at 1:52 Comment(0)
B
3

here is the solution provided by @InternalFX, but in JS (I used it so sharing):

function get_bigrams(string){
  var s = string.toLowerCase()
  var v = s.split('');
  for(var i=0; i<v.length; i++){ v[i] = s.slice(i, i + 2); }
  return v;
}

function string_similarity(str1, str2){
  if(str1.length>0 && str2.length>0){
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hits = 0;
    for(var x=0; x<pairs1.length; x++){
      for(var y=0; y<pairs2.length; y++){
        if(pairs1[x]==pairs2[y]) hits++;
    }}
    if(hits>0) return ((2.0 * hits) / union);
  }
  return 0.0
}
Brottman answered 28/2, 2020 at 12:32 Comment(0)
O
1

November 2019 Update. I found fuse to have some pretty decent upgrades. However, I could not get it to use bool's (i.e. OR, AND, etc operators) nor could I use the API search interface to filter results.

I discovered nextapps-de/flexsearch: https://github.com/nextapps-de/flexsearch and I believe it by far surpasses a lot of the other javascript search libraries that I've tried, and it has support bool's, filtering searches & pagination.

You can input a list of javascript objects for your search data (i.e. storage), and the API is fairly well documented: https://github.com/nextapps-de/flexsearch#api-overview

So far I've indexed close to 10,000 records, and my searches are next to immediate; i.e. unnoticeable amount of time for each search.

Overweigh answered 7/11, 2019 at 5:48 Comment(1)
This project is is bloated (> 100kb) and has a large amount of un-attended issues & PRs. I wouldn't use it for those two reasons.Daradarach
B
0
(function (int) {
    $("input[id=input]")
        .on("input", {
        sort: int
    }, function (e) {
        $.each(e.data.sort, function (index, value) {
          if ( value.indexOf($(e.target).val()) != -1 
              && value.charAt(0) === $(e.target).val().charAt(0) 
              && $(e.target).val().length === 3 ) {
                $("output[for=input]").val(value);
          };
          return false
        });
        return false
    });
}(["international", "splint", "tinder"]))

jsfiddle http://jsfiddle.net/guest271314/QP7z5/

Boote answered 26/4, 2014 at 4:3 Comment(0)
D
-3

This could be achieved by using Regex.

Example:

  const fuzzySearch = (list, searchValue) => {
    let buf = ".*" + searchValue.replace(/(.)/g, "$1.*").toLowerCase();
    var reg = new RegExp(buf);
    let newList = list.filter(function (e) {
      return reg.test(e.title.toLowerCase());
    });
    return newList;
  };

Working example: https://codesandbox.io/s/jovial-fermat-cilh1?file=/src/App.js:28894-29167

Deathday answered 13/8, 2021 at 16:25 Comment(2)
as mentioned earlier, your example breaks if you enter a parenthesis :)Flybynight
This really doesnt do whats being asked.Lacking
G
-5

Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.

The following code will helpful to use fuzzy sort in react.js.

  1. install fuzzy sort through npm,

    npm install fuzzysort
    
  2. Make a reference variable,

    const fuzzysort = require('fuzzysort')
    
  3. Use go() method to find matched strings

    search(keyword, category) {  
      return fuzzysort.go(keyword, data[category]);
    }
    

Full demo code in react.js

import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');

class App extends React.Component {
  constructor(props){
    super(props)
    this.state = {
      keyword: '',
      results: [],
    }
    console.log("data: ", data["steam_games"]);
  }

  search(keyword, category) {  
    return fuzzysort.go(keyword, data[category]);
  }

  render(){
    return (
      <div className="App">
        <input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
          value={this.state.keyword}
        />
        <button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
        {this.state.results !== null && this.state.results.length > 0 ?
          <h3>Results:</h3> : null
        }
        <ul>
        {this.state.results.map((item, index) =>{
            return(
              <li key={index}>{item.score} : {item.target}</li>
            )
          })
        }
        </ul>
      </div>
    );
  }
}

export default App;

For more refer FuzzySort

Gilead answered 20/11, 2020 at 11:0 Comment(2)
That's just exact copy of original library: github.com/farzher/fuzzysortDeidradeidre
You didn't check my repo. Here I used the fuzzysort package in react. There is no default solution available for integrating that fuzzysort in react.Gilead

© 2022 - 2024 — McMap. All rights reserved.