SQLite full-text search relevance ranking
Asked Answered
S

4

19

I am using the fts4 extension of sqlite3 to enable full-text indexing and searching of text data. This it working great, but I've noticed that the results are not relevance-ranked at all. I guess I am too used to Lucene. I've seen some brief suggestions to write a custom rank method using the matchinfo() results, but it's not clear to me how this is done, or whether there are any sophisticated examples out there. How have others dealt with this?

Subastral answered 1/9, 2011 at 14:44 Comment(0)
P
10

There's a complete example in the documentation, look at the end of appendix a. You'll need to do slightly more work to get a good relevance ranking as the function provided is good only for getting started. For example, with matchinfo(table,'pcnalx') there's enough information to implement Okapi BM25.

Policlinic answered 5/1, 2012 at 6:3 Comment(6)
Are there any public implementations of Okapi BM25 from matchinfo?Memoried
I don't know of any. I've implemented it myself, but the code isn't public.Policlinic
In case anyone runs into this, there are some tutorials on BM25/BM25F - irthoughts.wordpress.com/2011/08/03/…Memoried
It is, but it's not really the quality I'd want on something public. It's not that difficult though, basically a 1:1 translation of the wikipedia equation, which looks worse than it is. I did find flooring the summand at zero was best for my case, YMMV.Policlinic
Did you implement it as a function inside SQLite?Memoried
No, its a user defined function (using sqlite3_create_function).Policlinic
L
8

There seems to be a distinct lack of documentation on how to implement Okapi BM25 in C and it seems it is an unspoken thing that the implementation is left as an exercise for the user.

Well I found the bro of a programmer "Radford 'rads' Smith" who chucked this up on GitHub

https://github.com/rads/sqlite-okapi-bm25

It only implements BM25 although I'm looking into BM25F tweaks now....

....and here it is.

https://github.com/neozenith/sqlite-okapi-bm25

Leeward answered 26/5, 2014 at 22:48 Comment(2)
make gives a gcc: error: unrecognized command line option ‘-bundle’ on gcc 4.8.4 . I realize this answer is very old, and your last commit was over 2 years ago, but could you tell me which version of the gcc compiler you used to build the c file?Anuradhapura
configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin14.5.0 Thread model: posix This is on an OSX dev environment sorry. I haven't tested other build chains.Leeward
C
4

For FTS5, according to SQLite FTS5 Extension,

  • FTS5 has no matchinfo().
  • FTS5 supports ORDER BY rank

So very simply, something like

SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;

without DESC works.

Cramfull answered 23/8, 2020 at 13:42 Comment(1)
Some further details ... "All FTS5 tables feature a special hidden column named "rank". If the current query is not a full-text query (i.e. if it does not include a MATCH operator), the value of the "rank" column is always NULL. Otherwise, in a full-text query, column rank contains by default the same value as would be returned by executing the bm25() auxiliary function with no trailing arguments."Othella
S
2

Here is an implementation of Okapi BM25. Using this in combination with the suggestions at SQLite.org will help you generate a relevance-ranked MATCH query. This was written all in VB.Net and the query was called using System.Data.SQLite functions. The custom SQLiteFunction at the end can be called from the SQL code without issue, as long as the SQL code is called using System.Data.SQLite functions.

    Public Class MatchInfo
        Property matchablePhrases As Integer
        Property userDefinedColumns As Integer
        Property totalDocuments As Integer
        Private _int32HitData As List(Of Integer)
        Private _longestSubsequencePhraseMatches As New List(Of Integer)
        Private _tokensInDocument As New List(Of Integer)
        Private _averageTokensInDocument As New List(Of Integer)

        Private _max_hits_this_row As Integer?
        Public ReadOnly Property max_hits_this_row As Integer
            Get
                If _max_hits_this_row Is Nothing Then
                    _max_hits_this_row = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myHitsThisRow As Integer = hits_this_row(p, c)
                            If myHitsThisRow > _max_hits_this_row Then
                                _max_hits_this_row = myHitsThisRow
                            End If
                        Next
                    Next
                End If

                Return _max_hits_this_row
            End Get
        End Property

        Private _max_hits_all_rows As Integer?
        Public ReadOnly Property max_hits_all_rows As Integer
            Get
                If _max_hits_all_rows Is Nothing Then
                    _max_hits_all_rows = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myHitsAllRows As Integer = hits_all_rows(p, c)
                            If myHitsAllRows > _max_hits_all_rows Then
                                _max_hits_all_rows = myHitsAllRows
                            End If
                        Next
                    Next
                End If

                Return _max_hits_all_rows
            End Get
        End Property

        Private _max_docs_with_hits As Integer?
        Public ReadOnly Property max_docs_with_hits As Integer
            Get
                If _max_docs_with_hits Is Nothing Then
                    _max_docs_with_hits = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myDocsWithHits As Integer = docs_with_hits(p, c)
                            If myDocsWithHits > _max_docs_with_hits Then
                                _max_docs_with_hits = myDocsWithHits
                            End If
                        Next
                    Next
                End If

                Return _max_docs_with_hits
            End Get
        End Property

        Private _BM25Rank As Double?
        Public ReadOnly Property BM25Rank As Double
            Get
                If _BM25Rank Is Nothing Then
                    _BM25Rank = 0
                    'calculate BM25 Rank
                    'http://en.wikipedia.org/wiki/Okapi_BM25

                    'k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost.
                    'b, calibrates the scaling by document length, and can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.

                    Dim k1 As Double = 1.2
                    Dim b As Double = 0.75

                    For column = 0 To userDefinedColumns - 1
                        For phrase = 0 To matchablePhrases - 1
                            Dim IDF As Double = Math.Log((totalDocuments - hits_all_rows(phrase, column) + 0.5) / (hits_all_rows(phrase, column) + 0.5))
                            Dim score As Double = (IDF * ((hits_this_row(phrase, column) * (k1 + 1)) / (hits_this_row(phrase, column) + k1 * (1 - b + b * _tokensInDocument(column) / _averageTokensInDocument(column)))))
                            If score < 0 Then
                                score = 0
                            End If
                            _BM25Rank += score
                        Next
                    Next

                End If

                Return _BM25Rank
            End Get
        End Property

        Public Sub New(raw_pcnalsx_MatchInfo As Byte())
            Dim int32_pcsx_MatchInfo As New List(Of Integer)
            For i = 0 To raw_pcnalsx_MatchInfo.Length - 1 Step 4
                int32_pcsx_MatchInfo.Add(BitConverter.ToUInt32(raw_pcnalsx_MatchInfo, i))
            Next

            'take the raw data and parse it out
            Me.matchablePhrases = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            Me.userDefinedColumns = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            Me.totalDocuments = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            'remember that the columns are 0-based
            For i = 0 To userDefinedColumns - 1
                _averageTokensInDocument.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            For i = 0 To userDefinedColumns - 1
                _tokensInDocument.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            For i = 0 To userDefinedColumns - 1
                _longestSubsequencePhraseMatches.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            _int32HitData = New List(Of Integer)(int32_pcsx_MatchInfo)

        End Sub

        Public Function hits_this_row(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 0)
        End Function

        Public Function hits_all_rows(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 1)
        End Function

        Public Function docs_with_hits(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 2)
        End Function
    End Class

    <SQLiteFunction("Rank", 1, FunctionType.Scalar)>
    Public Class Rank
        Inherits SQLiteFunction

        Public Overrides Function Invoke(args() As Object) As Object
            Return New MatchInfo(args(0)).BM25Rank
        End Function

    End Class
Setose answered 2/8, 2013 at 21:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.