How to use full-text search in sqlite3 database in django?
Asked Answered
S

2

9

I am working on a django application with sqlite3 database, that has a fixed database content. By fixed I mean the content of the db won't change over time. The model is something like this:

class QScript(models.Model):
    ch_no = models.IntegerField()
    v_no = models.IntegerField()
    v = models.TextField()

There are around 6500 records in the table. Given a text that may have some words missing, or some words misspelled, I need to determine its ch_no and v_no. For example, if there is a v field in db with text "This is an example verse", a given text like "This an egsample verse" should give me the ch_no and v_no from db. This can be done using Full text search I believe.

My queries are:

  1. can full-text search do this? My guess from what I have studied, it can, as said in sqlite3 page: full-text searches is "what Google, Yahoo, and Bing do with documents placed on the World Wide Web". Cited in SO, I read this article too, along with many others, but didn't find anything that closely matches my requirements.

  2. How to use FTS in django models? I read this but it didn't help. It seems too outdated. Read here that: "...requires direct manipulation of the database to add the full-text index". Searching gives mostly MySQL related info, but I need to do it in sqlite3. So how to do that direct manipulation in sqlite3?


Edit:

Is my choice of sticking to sqlite3 correct? Or should I use something different (like haystack+elasticsearch as said by Alex Morozov)? My db will not grow any larger, and I have studied that for small sized db, sqlite is almost always better (my situation matches the fourth in sqlite's when to use checklist).

Slacker answered 26/1, 2016 at 18:7 Comment(0)
Y
4

I think that while sqlite is an amazing piece of software, its full-text search capabilities are quite limited. Instead you could index your database using Haystack Django app with some backend like Elasticsearch. Having this setup (and still being able to access your sqlite database) seems to me the most robust and flexible way in terms of FTS.

Elasticsearch has a fuzzy search based on the Levenshtein distance (in a nutshell, it would handle your "egsample" queries). So all you need is to make a right type of query:

from haystack.forms import SearchForm
from haystack.generic_views import SearchView
from haystack import indexes


class QScriptIndex(indexes.SearchIndex, indexes.Indexable):
    v = indexes.CharField(document=True)

    def get_model(self):
        return QScript


class QScriptSearchForm(SearchForm):
    text_fuzzy = forms.CharField(required=False)    

    def search(self):        
        sqs = super(QScriptSearchForm, self).search()

        if not self.is_valid():
            return self.no_query_found()

        text_fuzzy = self.cleaned_data.get('text_fuzzy')
        if text_fuzzy:
            sqs = sqs.filter(text__fuzzy=text_fuzzy)

        return sqs


class QScriptSearchView(SearchView):        
    form_class = QScriptSearchForm

Update: As long as PostgreSQL has the Levenshtein distance function, you could also leverage it as the Haystack backend as well as a standalone search engine. If you choose the second way, you'll have to implement a custom query expression, which is relatively easy if you're using a recent version of Django.

Yellowish answered 26/1, 2016 at 19:23 Comment(11)
This doesn't answer the question. It's just an advertisement for a different FTS solution, without any justification about what makes it better than SQLite for the OP or anyone else.Ongun
I have taken some quick looks into haystack. Do you know if haystack can accomplish what I have asked for in the question (the search part)?Lapland
Just one more thing. As I said in the question, my db will never grow and will contain just 1mb of unicode texts. Should I go for haystack+elasticsearch or should I switch to mysql and use its FTS, as FTS of sqlite3 doesn't have fuzzy search? Thanks in advance! Then I will finally mark yours as accepted ;-)Lapland
Sure, you can switch, but I'd recommend PostgreSQL over Mysql, as long as it has the built-in Levenshtein function, and you'll only need to make a thin wrapper around it. Check out the updated question.Yellowish
Actually what I meant is whether I should use any dbms that provides fuzzy search (may be mysql or postgresql, since sqlite3 doesn't have that), OR should I use hastack+elasticsearch, regarding the fact that my db will not grow any larger. I am expecting your suggestion as to which would be better option, regarding my scenario.Lapland
Please remember to tag me, or else I don't get a notification. For your last comment this is what happened, though your response was prompt.Lapland
@Shadowfax, sorry, I was pretty sure the post author was getting all the comments. More than that, until another person joins a conversation in comments, I'm unable to tag you, your name just doesn't show up.Yellowish
FYI, I am the post author :) And yes, it did come to my notification this time. Please read the comment before the one about notification.Lapland
Sure, in fact I was speaking about post authors in general, but my English might have set me back ). The database growth isn't really a factor here. What matters is the flexibility you're expecting. Elasticsearch is designed specifically for search, so it has neat options, like custom stemmers, typos handling and so on. If your search engine is going to be a sophisticated one, I'd go with Elastic (and Haystack is just a convenient wrapper here). And if plain Levenshtein distance is enough for you - rdbms will do just fine. A story gets long, drop a line to [email protected], if interested.Yellowish
The link is old. New link: docs.djangoproject.com/en/2.2/ref/models/expressions/…Misguidance
This is a poor answer, which does not answer the question. SQLite has fantastic full-text support especially with FTS5 - it supports tokenizers, snippets/highlights, bm25(), 'contentless' tables (which shrink down virtual tables significantly as there isn't a need to 'maintain' a separate index), and so on. To anyone reading this, do not resort to Haystack or ElasticSearch immediately. Not only could they be overkill solutions, but SQLite's FTS has covered significant grounds over the past years. Explore Django+SQLite further.Denham
O
5

SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.

A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

In short, the SQLite FTS extension is fairly basic, and isn't meant to compete with, say, Google.

As for Django integration, I don't believe there is any. You will likely need to use Django's interface for raw SQL queries, for both creating and querying the FTS table.

Ongun answered 26/1, 2016 at 23:3 Comment(2)
So in a nutshell, 1) sqlite's FTS can't accomplish what I need. 2) even if I enable FTS in sqlite3 for my computer, I can't use django's ORM (like QScript.objects.filter(v__search='egsample')) to use FTS features. Am I right to conclude these? If yes, what do you suggest I do? Thanks.Lapland
I write to confirm that you can use FTS with Django. But you will need to use raw queries, or execute custom SQL directly. I personally gravitate towards the latter approach. With it you can use django.db's connections and routers to 'switch' between different SQLite databases. This approach allows me to implement complex queries that interface between different SQLite databases, and is very useful.Denham
Y
4

I think that while sqlite is an amazing piece of software, its full-text search capabilities are quite limited. Instead you could index your database using Haystack Django app with some backend like Elasticsearch. Having this setup (and still being able to access your sqlite database) seems to me the most robust and flexible way in terms of FTS.

Elasticsearch has a fuzzy search based on the Levenshtein distance (in a nutshell, it would handle your "egsample" queries). So all you need is to make a right type of query:

from haystack.forms import SearchForm
from haystack.generic_views import SearchView
from haystack import indexes


class QScriptIndex(indexes.SearchIndex, indexes.Indexable):
    v = indexes.CharField(document=True)

    def get_model(self):
        return QScript


class QScriptSearchForm(SearchForm):
    text_fuzzy = forms.CharField(required=False)    

    def search(self):        
        sqs = super(QScriptSearchForm, self).search()

        if not self.is_valid():
            return self.no_query_found()

        text_fuzzy = self.cleaned_data.get('text_fuzzy')
        if text_fuzzy:
            sqs = sqs.filter(text__fuzzy=text_fuzzy)

        return sqs


class QScriptSearchView(SearchView):        
    form_class = QScriptSearchForm

Update: As long as PostgreSQL has the Levenshtein distance function, you could also leverage it as the Haystack backend as well as a standalone search engine. If you choose the second way, you'll have to implement a custom query expression, which is relatively easy if you're using a recent version of Django.

Yellowish answered 26/1, 2016 at 19:23 Comment(11)
This doesn't answer the question. It's just an advertisement for a different FTS solution, without any justification about what makes it better than SQLite for the OP or anyone else.Ongun
I have taken some quick looks into haystack. Do you know if haystack can accomplish what I have asked for in the question (the search part)?Lapland
Just one more thing. As I said in the question, my db will never grow and will contain just 1mb of unicode texts. Should I go for haystack+elasticsearch or should I switch to mysql and use its FTS, as FTS of sqlite3 doesn't have fuzzy search? Thanks in advance! Then I will finally mark yours as accepted ;-)Lapland
Sure, you can switch, but I'd recommend PostgreSQL over Mysql, as long as it has the built-in Levenshtein function, and you'll only need to make a thin wrapper around it. Check out the updated question.Yellowish
Actually what I meant is whether I should use any dbms that provides fuzzy search (may be mysql or postgresql, since sqlite3 doesn't have that), OR should I use hastack+elasticsearch, regarding the fact that my db will not grow any larger. I am expecting your suggestion as to which would be better option, regarding my scenario.Lapland
Please remember to tag me, or else I don't get a notification. For your last comment this is what happened, though your response was prompt.Lapland
@Shadowfax, sorry, I was pretty sure the post author was getting all the comments. More than that, until another person joins a conversation in comments, I'm unable to tag you, your name just doesn't show up.Yellowish
FYI, I am the post author :) And yes, it did come to my notification this time. Please read the comment before the one about notification.Lapland
Sure, in fact I was speaking about post authors in general, but my English might have set me back ). The database growth isn't really a factor here. What matters is the flexibility you're expecting. Elasticsearch is designed specifically for search, so it has neat options, like custom stemmers, typos handling and so on. If your search engine is going to be a sophisticated one, I'd go with Elastic (and Haystack is just a convenient wrapper here). And if plain Levenshtein distance is enough for you - rdbms will do just fine. A story gets long, drop a line to [email protected], if interested.Yellowish
The link is old. New link: docs.djangoproject.com/en/2.2/ref/models/expressions/…Misguidance
This is a poor answer, which does not answer the question. SQLite has fantastic full-text support especially with FTS5 - it supports tokenizers, snippets/highlights, bm25(), 'contentless' tables (which shrink down virtual tables significantly as there isn't a need to 'maintain' a separate index), and so on. To anyone reading this, do not resort to Haystack or ElasticSearch immediately. Not only could they be overkill solutions, but SQLite's FTS has covered significant grounds over the past years. Explore Django+SQLite further.Denham

© 2022 - 2024 — McMap. All rights reserved.