I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.
That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.
However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.
Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?