SQLite: Efficient substring search in large table
Asked Answered
D

5

11

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)

Note that only 20% of all entries contain strings in the "name" column.

Performing the following query almost takes 2 minutes:

SELECT elementID, name FROM Elements WHERE name LIKE %foo%

I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).

The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.

So actually I have two questions:

  1. Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...

  2. How can I get real substring search using FTS3?

Duckweed answered 4/7, 2012 at 19:58 Comment(3)
It takes a lot of shredding to get sub-word indexed searching...Rangy
Perhaps SQLite/FST isn't the best approach in this specific case .. it seems like a [read-only] Suffix Tree might be more suitable. Although the trick is finding one in an existing suitable library/tooling ;-)Rangy
@pst, Suffix Trees sound pretty cool, but unfortunately the SQLite approach is crucial for the main features of my application. Fast string search would have been a "nice to have", though. ;)Duckweed
S
11

Solution 1: If you can make every character in your database as an individual word, you can use phrase queries to search the substring.

For example, assume "my_table" contains a single column "person":

person
------
John Doe
Jane Doe

you can change it to

person
------
J o h n D o e
J a n e D o e

To search the substring "ohn", use phrase query:

SELECT * FROM my_table WHERE person MATCH '"o h n"'

Beware that "JohnD" will match "John Doe", which may not be desired. To fix it, change the space character in the original string into something else.

For example, you can replace the space character with "$":

person
------
J o h n $ D o e
J a n e $ D o e

Solution 2: Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.

The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.

Scission answered 2/4, 2013 at 2:9 Comment(9)
Thanks for the idea; that sounds promising. Adding all those spaces could blow up the size of my database (which is not what I want), but I'll give it a try as soon as I have time.Duckweed
If size is your concern, check solution 2.Scission
I've tested your first solution now. As expected, the database size has almost doubled, but query time is in an acceptable range ("regular" queries are still faster though - but of course they also don't provide all the results I need). I didn't have time to include your second solution in my project yet, but I tested the example you provided on GitHub, and it seems really promising as it keeps database size constant. I guess it's the best tradeoff between speed and quality of results, so I'll mark your answer as correct.Duckweed
Once the database has been created, do you need the tokenizer for queries or can you make MATCH queries anywhere with fts3?Farrier
You have to register the tokenizer when you open the database. Then, you can make queries. The tokenizer is required for sqlite to understand the database's content.Scission
I like solution 1. To add to that even further, maybe as a solution 3, I think you won't need solution 2. Simply utilize SQLite VIEWs and TRIGGERs along with “External Content Tables”—and your DB size should also remain constant.Maziar
More specifically, create a VIEW that spits out the characters of the original data, as space delimited. Then use that created VIEW as the external content table of the FTS.Maziar
How can we split a string by empty character (''), I tried with replace('test', '', ' ') but can't do @JasonSparcMelonymelos
@Melonymelos You can create a VIEW that splits the strings using recursive CTEs. For examples, see, stackoverflow.com/a/34663866Maziar
L
4

You should add an index to the name column on your database, that should speed up the query considerably.

I believe SQLite3 supports sub-string matching like so:

SELECT * FROM Elements WHERE name MATCH '*foo*';

http://www.sqlite.org/fts3.html#section_3

Lambdoid answered 4/7, 2012 at 20:48 Comment(8)
I just tried out your suggestions in the Android emulator. Substring matching in FTS3 seems to work as you suggested, but the query takes an extremely long time (I killed the app manually after 5 minutes). Unfortunately an index on the "name" column doesn't seem to work, the query time remains the same.Duckweed
If you have data in your FTS3 table that you don't need to run a full-text search on, you might consider removing it. I haven't done this myself, but you could try running an optimize command on the table and see if that speeds things up: sqlite.org/fts3.html#optimizeLambdoid
Also tested with optimize now, performance is a little better, but not much. Probably I will have to completely rethink the search feature of my app... I'll mark your answer as correct as your tips might help others with similar problems.Duckweed
Thanks and best of luck with your issue.Lambdoid
Did you face the problem that the MATCH function does not match suffixes so MATCH '*foo*' is actually MATCH 'foo*' meaning it will search for 'foobar' but not 'barfoo' ?Heliopolis
@lukya try removing the asterisk characters to make it a simple token query, rather than a prefix query: SELECT * FROM Elements WHERE name MATCH 'foo';Lambdoid
@Lambdoid token query does not match suffixes.. is there any other solution?.. i've added a question #11753340Heliopolis
@lukya I'm sorry, I don't know of any other solution.Lambdoid
B
0

I am facing some thing similar to your problem. Here is my suggestion try creating a translation table that will translate all the words to numbers. Then search numbers instead of words.

Please let me know if this is helping.

Batman answered 3/4, 2013 at 21:18 Comment(4)
That's an interesting idea, however I don't see how this could speed up the search. Note that an entry in my "name" column can consist of more than only one word, so there's the problem of storing multiple numbers in one entry. Moreover, substring search is impossible with number representations for each word.Duckweed
@Duckweed how about making a new column for each word? And punting all the sub worlds possibilities.Batman
Sorry for the delay in reply, I've been busy lately. With multiple columns I'd get quite a bit of overhead, because the maximum number of words per entry isn't predictable. Apart from that, I can't imagine this could speed up the search significantly (every single column would have to be searched independently).Duckweed
@Duckweed Actually I did it and it's magic. It work Fast. This what I did: I created words table with word_id. And I created subStrings table which is all the words(converted to words ids) as single string(separated by underscore so 1,10 and 11,0 will be different sub strings) all what I do now is converting my String request to subString and search it without MATCH. If I need to search for part of the word I do it in my words table.Batman
D
0

SQLITE now supports trigram indexes (starting with v3.34.0) which should help speed up substring queries:

Substrings consisting of fewer than 3 unicode characters do not match any rows when used with a full-text query. If a LIKE or GLOB pattern does not contain at least one sequence of non-wildcard unicode characters, FTS5 falls back to a linear scan of the entire table.

CREATE VIRTUAL TABLE [media_fts] USING FTS5 (
    [path],
    tokenize='trigram',
    content=[media]
);

Then you can do:

SELECT * FROM Elements WHERE name MATCH 'foo';
-- or
SELECT * FROM media_fts WHERE path MATCH '0000bq6dwok91'

or

select count(*) from tmp
JOIN media_fts ON path MATCH 'http redd ' || entry

https://www.sqlite.org/fts5.html#trigramidx

Despondent answered 18/5, 2023 at 19:36 Comment(0)
A
-1

not sure about speeding it up since you're using sqllite, but for substring searches, I have done things like

SET @foo_bar = 'foo bar'
SELECT * FROM table WHERE name LIKE '%' + REPLACE(@foo_bar, ' ', '%') + '%'

of course this only returns records that have the word "foo" before the word "bar".

Ashaashamed answered 4/7, 2012 at 20:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.