SQLite: Efficient substring search in large table

Asked 4/7, 2012 at 19:58 Answered 18/5, 2023 at 19:36

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)

Note that only 20% of all entries contain strings in the "name" column.

Performing the following query almost takes 2 minutes:

SELECT elementID, name FROM Elements WHERE name LIKE %foo%

I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).

The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.

So actually I have two questions:

Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...
How can I get real substring search using FTS3?

Duckweed answered 4/7, 2012 at 19:58 Comment(3)

It takes a lot of shredding to get sub-word indexed searching... – Rangy 4/7, 2012 at 20:5

Perhaps SQLite/FST isn't the best approach in this specific case .. it seems like a [read-only] Suffix Tree might be more suitable. Although the trick is finding one in an existing suitable library/tooling ;-) – Rangy 4/7, 2012 at 20:13

@pst, Suffix Trees sound pretty cool, but unfortunately the SQLite approach is crucial for the main features of my application. Fast string search would have been a "nice to have", though. ;) – Duckweed 4/7, 2012 at 20:32

Solution 1: If you can make every character in your database as an individual word, you can use phrase queries to search the substring.

For example, assume "my_table" contains a single column "person":

person
------
John Doe
Jane Doe

you can change it to

person
------
J o h n D o e
J a n e D o e

To search the substring "ohn", use phrase query:

SELECT * FROM my_table WHERE person MATCH '"o h n"'

Beware that "JohnD" will match "John Doe", which may not be desired. To fix it, change the space character in the original string into something else.

For example, you can replace the space character with "$":

person
------
J o h n $ D o e
J a n e $ D o e

Solution 2: Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.

The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.

Scission answered 2/4, 2013 at 2:9 Comment(9)

Thanks for the idea; that sounds promising. Adding all those spaces could blow up the size of my database (which is not what I want), but I'll give it a try as soon as I have time. – Duckweed 4/4, 2013 at 9:29

If size is your concern, check solution 2. – Scission 7/4, 2013 at 4:41

I've tested your first solution now. As expected, the database size has almost doubled, but query time is in an acceptable range ("regular" queries are still faster though - but of course they also don't provide all the results I need). I didn't have time to include your second solution in my project yet, but I tested the example you provided on GitHub, and it seems really promising as it keeps database size constant. I guess it's the best tradeoff between speed and quality of results, so I'll mark your answer as correct. – Duckweed 7/4, 2013 at 13:49

Once the database has been created, do you need the tokenizer for queries or can you make MATCH queries anywhere with fts3? – Farrier 30/8, 2013 at 22:57

You have to register the tokenizer when you open the database. Then, you can make queries. The tokenizer is required for sqlite to understand the database's content. – Scission 3/9, 2013 at 7:2

I like solution 1. To add to that even further, maybe as a solution 3, I think you won't need solution 2. Simply utilize SQLite VIEWs and TRIGGERs along with “External Content Tables”—and your DB size should also remain constant. – Maziar 1/5, 2019 at 19:15

More specifically, create a VIEW that spits out the characters of the original data, as space delimited. Then use that created VIEW as the external content table of the FTS. – Maziar 1/5, 2019 at 19:21

How can we split a string by empty character (''), I tried with replace('test', '', ' ') but can't do @JasonSparc – Melonymelos 28/7, 2021 at 6:24

@Melonymelos You can create a VIEW that splits the strings using recursive CTEs. For examples, see, stackoverflow.com/a/34663866 – Maziar 28/7, 2021 at 7:48

You should add an index to the name column on your database, that should speed up the query considerably.

I believe SQLite3 supports sub-string matching like so:

SELECT * FROM Elements WHERE name MATCH '*foo*';

http://www.sqlite.org/fts3.html#section_3

Lambdoid answered 4/7, 2012 at 20:48 Comment(8)

I just tried out your suggestions in the Android emulator. Substring matching in FTS3 seems to work as you suggested, but the query takes an extremely long time (I killed the app manually after 5 minutes). Unfortunately an index on the "name" column doesn't seem to work, the query time remains the same. – Duckweed 4/7, 2012 at 21:45

If you have data in your FTS3 table that you don't need to run a full-text search on, you might consider removing it. I haven't done this myself, but you could try running an optimize command on the table and see if that speeds things up: sqlite.org/fts3.html#optimize – Lambdoid 5/7, 2012 at 18:24

Also tested with optimize now, performance is a little better, but not much. Probably I will have to completely rethink the search feature of my app... I'll mark your answer as correct as your tips might help others with similar problems. – Duckweed 6/7, 2012 at 13:29

Thanks and best of luck with your issue. – Lambdoid 6/7, 2012 at 19:41

Did you face the problem that the MATCH function does not match suffixes so MATCH '*foo*' is actually MATCH 'foo*' meaning it will search for 'foobar' but not 'barfoo' ? – Heliopolis 1/8, 2012 at 4:26

@lukya try removing the asterisk characters to make it a simple token query, rather than a prefix query: SELECT * FROM Elements WHERE name MATCH 'foo'; – Lambdoid 1/8, 2012 at 6:26

@Lambdoid token query does not match suffixes.. is there any other solution?.. i've added a question #11753340 – Heliopolis 1/8, 2012 at 7:57

@lukya I'm sorry, I don't know of any other solution. – Lambdoid 1/8, 2012 at 8:8

I am facing some thing similar to your problem. Here is my suggestion try creating a translation table that will translate all the words to numbers. Then search numbers instead of words.

Please let me know if this is helping.

Batman answered 3/4, 2013 at 21:18 Comment(4)

That's an interesting idea, however I don't see how this could speed up the search. Note that an entry in my "name" column can consist of more than only one word, so there's the problem of storing multiple numbers in one entry. Moreover, substring search is impossible with number representations for each word. – Duckweed 4/4, 2013 at 9:23

@Duckweed how about making a new column for each word? And punting all the sub worlds possibilities. – Batman 4/4, 2013 at 9:27

Sorry for the delay in reply, I've been busy lately. With multiple columns I'd get quite a bit of overhead, because the maximum number of words per entry isn't predictable. Apart from that, I can't imagine this could speed up the search significantly (every single column would have to be searched independently). – Duckweed 7/4, 2013 at 13:46

@Duckweed Actually I did it and it's magic. It work Fast. This what I did: I created words table with word_id. And I created subStrings table which is all the words(converted to words ids) as single string(separated by underscore so 1,10 and 11,0 will be different sub strings) all what I do now is converting my String request to subString and search it without MATCH. If I need to search for part of the word I do it in my words table. – Batman 7/4, 2013 at 14:47

SQLITE now supports trigram indexes (starting with v3.34.0) which should help speed up substring queries:

Substrings consisting of fewer than 3 unicode characters do not match any rows when used with a full-text query. If a LIKE or GLOB pattern does not contain at least one sequence of non-wildcard unicode characters, FTS5 falls back to a linear scan of the entire table.

CREATE VIRTUAL TABLE [media_fts] USING FTS5 (
    [path],
    tokenize='trigram',
    content=[media]
);

Then you can do:

SELECT * FROM Elements WHERE name MATCH 'foo';
-- or
SELECT * FROM media_fts WHERE path MATCH '0000bq6dwok91'

select count(*) from tmp
JOIN media_fts ON path MATCH 'http redd ' || entry

https://www.sqlite.org/fts5.html#trigramidx

Despondent answered 18/5, 2023 at 19:36 Comment(0)

-1

not sure about speeding it up since you're using sqllite, but for substring searches, I have done things like

SET @foo_bar = 'foo bar'
SELECT * FROM table WHERE name LIKE '%' + REPLACE(@foo_bar, ' ', '%') + '%'

of course this only returns records that have the word "foo" before the word "bar".

Ashaashamed answered 4/7, 2012 at 20:6 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags