Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3
module?
Example of use case:
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
This query should match the row with ID 1, but not the row with ID 2:
c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')
How to do this in Sqlite+Python?
Notes about what I've found so far:
The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)
The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance
I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool
Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
but I don't find about string comparison with such a "similarity distance", FTS's features
MATCH
orNEAR
don't seem to have similarity measure with letters changes, etc.Moreover this answer shows that:
SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.