Approximate String Matching using LSH
Asked Answered
F

1

27

I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.

That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.

However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.

Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?

Falls answered 4/8, 2014 at 8:17 Comment(3)
I don't understand the question. You say you know what to do "if it was text", but you have no idea what to do with strings? What's the relevant difference between "text" and "strings" in your head that's making this problem unsolvable?Delano
I can represent a text using the vector space model. In the case of a string, I do not have a vector of appearances but a vector of ASCII codes that represent something different.Falls
I see the problem. The version of LSH I'm familiar with (an early version) assumes a Euclidian metric. With strings, you use a string metric... which won't work. You need a LSH for non-Euclidian space.Wojcik
R
34

The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.

Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.

There's a great implementation in the Python datasketch library (pip install datasketch). Here's an example that shows you can catch fuzzy string similarity:

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
  'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
  'weights controls the relative importance between minizing false positive',
  'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]

# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)

# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
  minhash = MinHash(num_perm=128)
  for d in ngrams(i, 3):
    minhash.update("".join(d).encode('utf-8'))
  lsh.insert(c, minhash)
  minhashes[c] = minhash

for i in xrange(len(minhashes.keys())):
  result = lsh.query(minhashes[i])
  print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result

This returns:

Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]
Rubenstein answered 22/1, 2017 at 15:52 Comment(6)
With a .5 threshold only the input is returned. To get the correct result needs .42 or lower threshold (python 3.5 and current version of datasketch)Mil
@Rubenstein does this code work on python 3? When I use it, I get the output Candidates with Jaccard similarity > 0.5 for input 0 : [0] and similarly 1 : [1] and so onArticulation
@Articulation I just tested on 3.6.1 (Anaconda distribution) and got the output above. Maybe try to update your datasketch and nltk libraries? pip install datasketch -U && pip install nltk -URubenstein
@Rubenstein I tried that as well. I am using IDLE it still shows me the same When I drop the threshold to 0.4 it starts to give the output that you've shownArticulation
@Rubenstein would it be possible for you to please have a look at my question here and mention if your answer here can solve my issue? It'll be of great helpArticulation
I try it and got the same result as you the threshold is 0.4Cracow

© 2022 - 2024 — McMap. All rights reserved.