String similarity with Python + Sqlite (Levenshtein distance / edit distance)
Asked Answered
K

2

8

Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module?

Example of use case:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

This query should match the row with ID 1, but not the row with ID 2:

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

How to do this in Sqlite+Python?

Notes about what I've found so far:

  • The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)

  • The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance

  • I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool

  • Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5

    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
    CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
    SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
    SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
    

    but I don't find about string comparison with such a "similarity distance", FTS's features MATCH or NEAR don't seem to have similarity measure with letters changes, etc.

  • Moreover this answer shows that:

    SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
    A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

    The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

Kizzee answered 11/4, 2018 at 15:41 Comment(0)
K
10

Here is a ready-to-use example test.py:

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

Important note: The distance editdist3 is normalized so that

the value of 100 is used for insertion and deletion and 150 is used for substitution


Here is what to do first on Windows:

  1. Download https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them

  2. Replace C:\Python27\DLLs\sqlite3.dll by the new sqlite3.dll from here. If skipping this, you'd get a sqlite3.OperationalError: The specified procedure could not be found later

  3. Run:

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"  
    

    or

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
    cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
    python test.py
    

    (With MinGW, it would be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

Here is how to do it on Linux Debian:

(based on this answer)

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

Here is how to do it on Linux Debian with an older Python version:

If your distribution's Python is a bit old, it will require another method. As sqlite3 module is built-in in Python, it seems not straightforward to upgrade it (pip install --upgrade pysqlite would only upgrade the pysqlite module, not the underlying SQLite library). Thus this method works for example if import sqlite3; print sqlite3.sqlite_version is 3.8.2:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]
Kizzee answered 13/4, 2018 at 10:58 Comment(12)
Thanks @JacquesGaudin, I fixed the link, and included your MinGW version.Kizzee
Always a pleasure, I learnt a bit of msvc command line on the way!Abreaction
Very nice write up :-). Note that, with a little effort, you can modify the weights to differ from 100 and 150.Emblazonry
Oh that can be interesting @bodo. Have you tried other weights? (Initially I thought about using weight 1 (or 100) for everything like in the usual definition of Levenshtein distance, but then I thought: there's maybe a reason why 100 and 150 are used as default...) ? I'm curious about this, since I'm coding a "Are you sure you text you entered is not already in our database?" feature [a bit similar to SO's "Questions that may already have your answer" when you ask a question], thus I need to compare the user input text with the texts already present in database.Kizzee
i tried: G:\downloads>C:\HashiCorp\Vagrant\embedded\mingw64\bin\gcc.exe -g -fPIC -shared G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c -I G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/ -o spellfix.so . and got error. the comand G:\downloads is not foundEpilogue
@Epilogue this is folder name problem, probably use with quotes, etc. Except this, I have no idea about what it could beKizzee
@Kizzee C:\HashiCorp\Vagrant\embedded\mingw64\bin>x86_64-w64-mingw32-gcc-7.3.0.exe -g -fPIC -shared "G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c" -I "G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/" -o spellfix.so In file included from G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c:17:0: G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/sqlite3ext.h:20:10: fatal error: sqlite3.h: No such file or directory #include "sqlite3.h"Epilogue
no i have no error. do i have now a new dll? i hope i could now use the sqlite3.dll with this levenshein function. but file date is 26.11. but today is 28.11. hmmEpilogue
now i have no error. now i replaced the sqlite3.dll 1,78MB from 25.11.2018 with sqlite3.dll 742kb. first i surprized that i works. works as before. and second i surprized of this error: no such function: editdist3Epilogue
It can be hundreds of reasons depending on your local configuration. Please post a new question on stackoverflow. All I can say is that the process detailed here works perfectly on several machines I tried it on.Kizzee
for what download you sqlite.org/2016/sqlite-src-3110100.zip ? It seems to me it is not used in the further instructions for windows?Epilogue
Please reread the Windows instructions, it explicitely uses sqlite-src-3110100/ext/misc/spellfix.c. Please post a new question for your issues, the comments here are not well adapted to do chat / support.Kizzee
C
5

I implemented distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) as a SQLite run-time loadable extension. Any UTF-8 strings are supported.

https://github.com/schiffma/distlib

Colmar answered 14/7, 2021 at 11:49 Comment(2)
Welcome on StackOverflow, and big up for this great-looking Github repo. For this to be a useful answer here (link-only answers are not accepted), can you maybe add more context, with a sample reproducible code (in a few lines of Python) showing how to actually use it?Kizzee
Agree with Basj comment. That's an excellent resource and it works great! You should add some python code and explanations in your answer to show how to use and install it. Your Github with a lot of .cpp can be intimidating for some Python programmers...Gide

© 2022 - 2024 — McMap. All rights reserved.