How to stop . being treated as a separator in SQLite FTS4
Asked Answered
U

3

2

I want to be able to search for numbers like 2.3 using FTS4 in SQLite, but the . is being treated as a token boundary. Short of writing a full bespoke tokenizer is there any other way of excluding the . from the list of token boundary characters?

Being able to search for decimal numbers seems like a common use case, but I can't find anything relevant on SO / Google. My best solution at present is to replace all . chars in the text with a known (long) string of letters and substitute accordingly on each search...

Peter

Undertone answered 18/4, 2013 at 10:19 Comment(0)
P
0

The tokenizer defines what tokens are, so you would indeed need to write your own.

You could search for the phrase "2 3", which would find the numbers 2 and 3 with any separator.

Puffin answered 18/4, 2013 at 16:44 Comment(1)
Hi CL - I reckon you're right. I was hoping that someone might have solved this before, but I guess it's uncommon to search for numbers in text. I solved for my purposes by regexing (\d+)\.(\d+) to $1XXUNIQUEPHRASEXX$2 prior to tokenizing, and doing the same to my search text. Works fine, but feels uuuuugggggllly. Actually would be very hard to do properly due to localisation (, vs . in Europe etc.) . Thanks for the reply - I appreciate it.Undertone
I
9

Actually you don't need to write your own tokenizer.

The 'simple' tokenizer allows the delimiters to be customized, but it's an undocumented feature.

Create your FTS table as follows, and SQLite will tokenize on ' ' (space) and '#' (hash) characters only:

CREATE VIRTUAL TABLE documents USING fts4(title, content, tokenize=simple '' '# ');

There was a discussion on the SQLite mailing list about this feature in 2012 here. The source code in SQLite which supports this feature is here.

Quote from the mailing list regarding its lack of documentation:

"Likely the reason is that we forgot that this feature even exists. It seems to have existed in the simple tokenizer, unchanged, since the original introduction of FTS1 back in 2006."

Quote regarding whether it's safe to use the feature:

"But it has been in the code for so long now that we dare not change it for fear of breaking long-established programs."

...however it was also mentioned that the feature is not likely to have been tested thoroughly.

I have just emailed the SQLite users mailing list, asking if this feature can be documented.

Update: One thing I'd add is that we found that if characters such as '*' or '-' are configured as delimiters like this, then they would also be stripped out of the FTS MATCH clause in queries. So while this works fine for '.' etc., it can't be used for characters which are also special characters in FTS MATCH.

Immensity answered 16/9, 2014 at 15:30 Comment(1)
Thanks v much for this answer. Original code is done & deployed, but I'll try your suggestion next time :-)Undertone
B
5

You don't need to write your own tokenizer: some built in tokenizers allow you at table creation time to specify that some characters should or should not be treated as word characters. For example, the unicode61 tokenizer (see the FTS3 docs, you may have to enable support for it at build time) allows you to pass tokenchars and separators parameters (poke around in the source).

In your case, I believe you could use something like this:

CREATE VIRTUAL TABLE whatever USING fts4(
    columns,

    tokenize=unicode61 "tokenchars=."
);
Brasier answered 23/8, 2013 at 19:21 Comment(0)
P
0

The tokenizer defines what tokens are, so you would indeed need to write your own.

You could search for the phrase "2 3", which would find the numbers 2 and 3 with any separator.

Puffin answered 18/4, 2013 at 16:44 Comment(1)
Hi CL - I reckon you're right. I was hoping that someone might have solved this before, but I guess it's uncommon to search for numbers in text. I solved for my purposes by regexing (\d+)\.(\d+) to $1XXUNIQUEPHRASEXX$2 prior to tokenizing, and doing the same to my search text. Works fine, but feels uuuuugggggllly. Actually would be very hard to do properly due to localisation (, vs . in Europe etc.) . Thanks for the reply - I appreciate it.Undertone

© 2022 - 2024 — McMap. All rights reserved.