Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)
Asked Answered
G

2

12

Let's say we have 1 million of rows like this:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

Background:

I know how to do this with Sqlite:

  • Find a row with a single-word query, up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...):

    db.enable_load_extension(True)
    db.load_extension('./spellfix')
    c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()
    
    #Query: 'Riehmand'
    #Answer: [(1, u'Riemann')]
    

    With 1M rows, this would be super slow! As detailed here, postgresql might have an optimization with this using trigrams. A fast solution, available with Sqlite, is to use a VIRTUAL TABLE USING spellfix:

    c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
    c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
    c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()
    
    #Query: 'Riehmand'
    #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!
    
  • Find an expression with a query matching one or multiple words with FTS ("Full Text Search"):

    c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
    c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
    c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()
    
    #Query: 'NUMBERS carmichael'
    #Answer: [(2, u'All the Carmichael numbers')]
    

    It is case insensitive and you can even use a query with two words in the wrong order, etc.: FTS is quite powerful indeed. But the drawback is that each of the query-keyword must be correctly spelled, i.e. FTS alone doesn't allow spelling mistakes.

Question:

How to do a Full Text Search (FTS) with Sqlite and also allow spelling mistakes? i.e. "FTS + spellfix" together

Example:

  • row in the DB: "All the Carmichael numbers"
  • query: "NUMMBER carmickaeel" should match it!

How to do this with Sqlite?

It is probably possible with Sqlite since this page states:

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

Linked question: String similarity with Python + Sqlite (Levenshtein distance / edit distance)

Galingale answered 14/10, 2018 at 13:12 Comment(5)
Why not use MATCH with a spellfix virtual table instead of that editdist thing? It'll be a lot faster. (Pretty near instant on tables with a few hundred thousand rows in my experience)Korenblat
It doesn't really answer your actual question (I think the answer to that is that there isn't a practical way if the aux tables described in the documentation doesn't do what you want). It's just strange to see an example of using spellfix to do approximate matching without actually using spellfix. The spellfix documentation you linked to has examples of the usual usage.Korenblat
@Korenblat I edited the question (see second part of first bullet point) to show an example of what you probably spoke about (using VIRTUAL TABLE USING spellfix1). The page about spellfix1 states Or, it could be used with FTS4 to do full-text search using potentially misspelled words. but I can't find a way to make it work. Could you post an example?Galingale
I think the intent is that you add the FTS corpus to a spellfix table, and for each word you want to look up in the FTS corpus, you first match it against the spellfix table and use the first result in the FTS query. This doesn't seem very practical, though, especially when you're searching for more than just a single word.Korenblat
Maybe @Shawn, but I have no idea about how to do it... I'll create a bounty when possible because this case is really important in applications. Being able to find the row "All the Carmichael numbers" with a query "NUMMBER carmickaeel" (i.e. with both 1) not all the words 2) spelling mistakes) is the Holy Graal in the field of searching text in a database. So having a ready-to-use Sqlite code example for this would be super interesting, and would allow a Google-like user experience for search.Galingale
B
14

The spellfix1 documentation actually tells you how to do this. From the Overview section:

If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an fts4aux table:

INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

The SELECT term from search_aux WHERE col='*' statement extracts all the indexed tokens.

Connecting this with your examples, where mytable2 is your fts4 virtual table, you can create a fts4aux table and insert those tokens into your mytable3 spellfix1 table with:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

You probably want to further qualify that query to skip any terms already inserted into spellfix1, otherwise you end up with double entries:

INSERT INTO mytable3(word)
    SELECT term FROM mytable2_terms
    WHERE col='*' AND 
        term not in (SELECT word from mytable3_vocab);

Now you can use mytable3 to map misspelled words to corrected tokens, then use those corrected tokens in a MATCH query againsts mytable2.

Depending on your neads, this may mean you need to do your own token handling and query building; there is no exposed fts4 query syntax parser. So your two-token search string would need to be split, each token run through the spellfix1 table to map to existing tokens, and then those tokens fed to the fts4 query.

Ignoring SQL syntax to handle this, using Python to do the splitting is easy enough:

def spellcheck_terms(conn, terms):
    cursor = conn.cursor()
    base_spellfix = """
        SELECT :term{0} as term, word FROM spellfix1data
        WHERE word MATCH :term{0} and top=1
    """
    terms = terms.split()
    params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
    query = " UNION ".join([
        base_spellfix.format(i + 1) for i in range(len(params))])
    cursor.execute(query, params)
    correction_map = dict(cursor)
    return " ".join([correction_map.get(t, t) for t in terms])

def spellchecked_search(conn, terms):
    corrected_terms = spellcheck_terms(conn, terms)
    cursor = conn.cursor()
    fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
    cursor.execute(fts_query, (corrected_terms,))
    return cursor.fetchall()

This then returns [('All the Carmichael numbers',)] for spellchecked_search(db, "NUMMBER carmickaeel").

Keeping the spellcheck handling in Python then allows you to support more complex FTS queries as needed; you may have to reimplement the expression parser to do so, but at least Python gives you the tools to do just that.

A complete example, packaging up the above approach in a class, which simply extract terms as alphanumeric character sequences (which, by my reading of the expression syntax specs, suffices):

import re
import sqlite3
import sys

class FTS4SpellfixSearch(object):
    def __init__(self, conn, spellfix1_path):
        self.conn = conn
        self.conn.enable_load_extension(True)
        self.conn.load_extension(spellfix1_path)

    def create_schema(self):
        self.conn.executescript(
            """
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
                USING fts4(description text);
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
                USING fts4aux(fts4data);
            CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
                USING spellfix1;
            """
        )

    def index_text(self, *text):
        cursor = self.conn.cursor()
        with self.conn:
            params = ((t,) for t in text)
            cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
            cursor.execute(
                """
                INSERT INTO spellfix1data(word)
                SELECT term FROM fts4data_terms
                WHERE col='*' AND
                    term not in (SELECT word from spellfix1data_vocab)
                """
            )

    # fts3 / 4 search expression tokenizer
    # no attempt is made to validate the expression, only
    # to identify valid search terms and extract them.
    # the fts3/4 tokenizer considers any alphanumeric ASCII character
    # and character in the range U+0080 and over to be terms.
    if sys.maxunicode == 0xFFFF:
        # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
    else:
        # UCS4
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")

    def _terms_from_query(self, search_query):
        """Extract search terms from a fts3/4 query

        Returns a list of terms and a template such that
        template.format(*terms) reconstructs the original query.

        terms using partial* syntax are ignored, as you can't distinguish
        between a misspelled prefix search that happens to match existing
        tokens and a valid spelling that happens to have 'near' tokens in
        the spellfix1 database that would not otherwise be matched by fts4

        """
        template, terms, lastpos = [], [], 0
        for match in self._fts4_expr_terms.finditer(search_query):
            token, (start, end) = match.group(), match.span()
            # skip columnname: and partial* terms by checking next character
            ismeta = search_query[end:end + 1] in {":", "*"}
            # skip digits if preceded by "NEAR/"
            ismeta = ismeta or (
                token.isdigit() and template and template[-1] == "NEAR"
                and "/" in search_query[lastpos:start])
            if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
                # full search term, not a keyword, column name or partial*
                terms.append(token)
                token = "{}"
            template += search_query[lastpos:start], token
            lastpos = end
        template.append(search_query[lastpos:])
        return terms, "".join(template)

    def spellcheck_terms(self, search_query):
        cursor = self.conn.cursor()
        base_spellfix = """
            SELECT :term{0} as term, word FROM spellfix1data
            WHERE word MATCH :term{0} and top=1
        """
        terms, template = self._terms_from_query(search_query)
        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
        query = " UNION ".join(
            [base_spellfix.format(i + 1) for i in range(len(params))]
        )
        cursor.execute(query, params)
        correction_map = dict(cursor)
        return template.format(*(correction_map.get(t, t) for t in terms))

    def search(self, search_query):
        corrected_query = self.spellcheck_terms(search_query)
        cursor = self.conn.cursor()
        fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
        cursor.execute(fts_query, (corrected_query,))
        return {
            "terms": search_query,
            "corrected": corrected_query,
            "results": cursor.fetchall(),
        }

and an interactive demo using the class:

>>> db = sqlite3.connect(":memory:")
>>> fts = FTS4SpellfixSearch(db, './spellfix')
>>> fts.create_schema()
>>> fts.index_text("All the Carmichael numbers")  # your example
>>> from pprint import pprint
>>> pprint(fts.search('NUMMBER carmickaeel'))
{'corrected': 'numbers carmichael',
 'results': [('All the Carmichael numbers',)],
 'terms': 'NUMMBER carmickaeel'}
>>> fts.index_text(
...     "They are great",
...     "Here some other numbers",
... )
>>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches
{'corrected': 'here some',
 'results': [('Here some other numbers',)],
 'terms': 'here some'}
>>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax 
{'corrected': 'numbers NOT carmichael',
 'results': [('Here some other numbers',)],
 'terms': 'NUMMBER NOT carmickaeel'}
Besot answered 17/10, 2018 at 17:42 Comment(2)
@Basj: single calls to execute() easily beat multiple calls to execute(), so yes, a UNION query is going to be faster. The SELECT :termx, word ... query leaves room for misses, so words that can't be spell corrected. Otherwise nummbers auoaaixao carrmichal would result in an empty result between numbers and carmichael and you'd have no way to determine what input mapped to what output. With the original un-corrected term included, you can now trivially map those to corrections and leave incorrigable terms alone.Besot
@Basj: I ran a test with 1000 randomly generated 'words' against the database generated here, and a function that uses a UNION-based query executes in 2.6 milliseconds per test (100 repeats), while a function that loops and executes a separate query per term executes in 2.9 milliseconds (same number of repetitions). It's a limited test, and the limited spellfix1 table generated here produced just a single result for those 100 random terms, but it's a good indicator why you'd want to use UNION here.Besot
G
3

The accepted answer is good (full credit to him), here is a slight variation that, although not as complete as the accepted one for complex cases, is helpful to grasp the idea:

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')
c = db.cursor()
c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)")
c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)")
c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1")
c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')")   # populate the table
c.execute("INSERT INTO mytable2 VALUES ('They are great')")
c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')")
c.execute("INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'")

def search(query):
    # Correcting each query term with spellfix table
    correctedquery = []
    for t in query.split():
        spellfix_query = "SELECT word FROM mytable3 WHERE word MATCH ? and top=1"
        c.execute(spellfix_query, (t,))
        r = c.fetchone()
        correctedquery.append(r[0] if r is not None else t)  # correct the word if any match in the spellfix table; if no match, keep the word spelled as it is (then the search will give no result!)

    correctedquery = ' '.join(correctedquery)

    # Now do the FTS
    fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?'
    c.execute(fts_query, (correctedquery,))
    return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query}

print(search('NUMBBERS carmickaeel'))
print(search('some HERE'))
print(search('some qsdhiuhsd'))

Here is the result:

{'query': 'NUMBBERS carmickaeel', 'correctedquery': u'numbers carmichael', 'result': [(u'All the Carmichael numbers',)]}
{'query': 'some HERE', 'correctedquery': u'some here', 'result': [(u'Here some other numbers',)]}
{'query': 'some qsdhiuhsd', 'correctedquery': u'some qsdhiuhsd', 'result': []}

Remark: It can be noted that the "Correcting each query term with spellfix table" part is done with one SQL query per term. The performance of this versus one single UNION SQL query is studied here.

Galingale answered 23/10, 2018 at 8:32 Comment(1)
Where does the link to performance study point to? Can u please provide a correct link? The existing link in the "here" points to this page itself!Januarius

© 2022 - 2024 — McMap. All rights reserved.