Recursive Relationship on Dictionary Table
Asked Answered
D

4

2

I'm working on a poor, but ok for us, full-text search using only PSQL in Firebird. I'll try to simplify as much as possible by focusing on my problem:

Summing up, this a dictionary table:

SELECT * FROM FTS_KEYWORDS

 ID | KEYWORD
----+-----------
  1 | 'FORD'
  1 | 'MUSTANG'
  1 | '2010'
  2 | 'FORD'
  2 | 'FUSION'
  2 | 'TURBO'
  2 | '2010'
  3 | 'FORD'
  3 | 'RANGER'
  3 | 'TURBO'
  3 | '2010'
  3 | 'BLACK'

There is too a FTS_TOKENIZE() procedure to get the words from the whole strings


Case 1: User search with 1 keyword

SELECT TOKENS FROM FTS_TOKENIZE('FORD')

 TOKENS
-------------
  'FORD'

This would then be the SQL required to get the correct results:

:TOKEN_1 = 'FORD'

SELECT DISTINCT ID
FROM FTS_KEYWORDS
WHERE (KEYWORD STARTING :TOKEN_1)

 ID 
-----
  1
  2 
  3 

Case 2: User search with 3 keywords

SELECT TOKENS FROM FTS_TOKENIZE('FORD 2010 BLACK')

 TOKENS
-------------
 'FORD'
 '2010'
 'BLACK'

So, SQL to retrieve the correct values:

:TOKEN_1 = 'FORD'
:TOKEN_2 = '2010'
:TOKEN_3 = 'BLACK'

SELECT DISTINCT K1.ID
FROM FTS_KEYWORDS K1
WHERE (K1.KEYWORD STARTING :TOKEN_1)
  AND (K1.ID IN (SELECT DISTINCT K2.ID
                 FROM FTS_KEYWORDS K2
                 WHERE (K2.KEYWORD STARTING :TOKEN_2)))
                   AND (K2.ID IN (SELECT DISTINCT K3.ID
                                  FROM FTS_KEYWORDS K3
                                  WHERE (K3.KEYWORD STARTING :TOKEN_3)))

 ID 
-----
  3 

ID 3 is the only ID that has all the keywords matching the search.

The SQL to retrieve values is a recursive nested by the tokens amount user query search.

Currently, in a procedure FTS_SEARCH(), I build a SQL string and use then in an EXECUTE STATEMENT way, but I do not think this is ideal.

I think this can be done with recursive Common Table Expressions (“WITH ... AS ... SELECT”), but I was not able to do it, because, based on the current examples available, it requires a table with Parent_ID and does not accept input parameters, which is not my case.

My question is: Is there a way to do this search in a recursive way using CTE or other SQL trick?

Disjunction answered 20/5, 2019 at 12:9 Comment(0)
Z
1

You can do this by building prefixed list. As prefix i have used ASCII_CHAR(5)

SELECT 
  K.ID, COUNT(*) 
FROM FTS_KEYWORDS K
WHERE
  (SELECT ASCII_CHAR(5) || LIST(T.TOKEN, ASCII_CHAR(5)) || ASCII_CHAR(5) FROM FTS_TOKENIZE('FORD 2010 BLACK') T)
  LIKE '%' || ASCII_CHAR(5) || K.KEYWORD || ASCII_CHAR(5) || '%'
GROUP BY K.ID
HAVING COUNT(*)=(SELECT COUNT(*) FROM FTS_TOKENIZE('FORD 2010 BLACK') TX)

this should be faster (lower fetches), but you must test this in your environment.

You can speed this up also by removing FTS_TOKENIZE at all and instead of 'FORD 2010 BLACK' you simply do

SELECT 
  K.ID, COUNT(*) 
FROM FTS_KEYWORDS K
WHERE
  ASCII_CHAR(5) || 'FORD' || ASCII_CHAR(5) || '2010' || ASCII_CHAR(5) || 'BLACK' || ASCII_CHAR(5) 
  LIKE '%' || ASCII_CHAR(5) || K.KEYWORD || ASCII_CHAR(5) || '%'
GROUP BY K.ID
HAVING COUNT(*)=3

but i do not know your real case especially how this string is build to pass to FTS_TOKENIZE

UPDATE1 Not the answer to your question but you can optimize your current query by:

SELECT
    DISTINCT K1.ID
FROM
    FTS_KEYWORDS K1
    INNER JOIN FTS_KEYWORDS K2 ON K2.ID = K1.ID AND K2.KEYWORD STARTING 'FORD'
    INNER JOIN FTS_KEYWORDS K3 ON K3.ID = K2.ID AND K3.KEYWORD STARTING '2010'
WHERE
    K1.KEYWORD STARTING 'BLACK' 
Zinck answered 23/5, 2019 at 17:18 Comment(3)
Thanks, but your solution had a poor performance in my environment, in a 17 millions keywords table, it's takes 1-2 minutes to process..Disjunction
I will look if this can be done in different way but in the meantime i have added optimized query of your oryginal query - instead of in i use join, should be much fasterZinck
You are right, with JOIN is much faster and the seachs takes just some milliseconds.. I think, in my case, the best alternative is to make a EXECUTE STATEMENT using the JOIN you mentioned like thisDisjunction
P
0

Instead of using a recursive CTE, you could put your list of tokens into a table (CRITERIA), join that table with FTS_KEYWORDS on KEYWORD, group by ID and count the number of keywords per ID, and apply a HAVING clause to select only those ID values with a count equal to the number of rows in the CRITERIA table.

Praedial answered 20/5, 2019 at 17:45 Comment(6)
Thanks, but your suggestion would only work if I used = instead of STARTING, which would limit my searches to just exact keywords, which does not meet my needDisjunction
You should be able to use STARTING in the join expression instead of =.Praedial
That might work for the join, but the count will be off if multiple keywords match the same token.Vacationist
@MarkRotteveel maybe it would be a good property, if user duplicate search term - maybe it is of more importance? then the condition should be changed to counts being >= not strictly =.Nicknack
@Arioch'The Possibly, but the OP expects only ids that match all tokens. Otherwise you could just count the number of matches and order by that count descending.Vacationist
@MarkRotteveel assuming CRITERIA is a per-transaction GTT copy of FTS_TOKENIZE results, the correlated subquery instead of join maybe can do. Something along Select a.ID, count(*) from (Select distinct ID from FTS_KEYWORDS) a, (Select c.token from CRITERIA c where Exists (select * from FTS_KEYWORDS k where k.id=a.id and c.token = /*starting, containing, like, >=,... */ k.keyword)) group by a.id - of course a better be direct query to some master table with ID being unique by design.Nicknack
V
0

Instead of resorting to using a recursive CTE (and I don't know if using a recursive CTE will actually solve your problem nor if it would perform), I propose the following solution:

WITH tokens AS (
    SELECT COUNT(*) OVER () tokencount, token 
    FROM fts_tokenize('FORD 2010 BLACK')
)
SELECT id
FROM (
    SELECT DISTINCT tokencount, token, id
    FROM tokens t
    INNER JOIN fts_keywords k
        ON k.KEYWORD STARTING WITH t.token
)
GROUP BY id
HAVING MAX(tokencount) = count(*)

This will track the number of tokens (not keywords!) matched and only output those ids where the number of matched tokens is equal to the number of expected tokens.

Tracking the number of tokens and not keywords is important given your need to use STARTING (STARTING WITH) as that could match multiple keywords to a single token which should be counted only once.

Be aware, this solution does assume that fts_tokenize will only output a token once, otherwise you'll need to modify the tokens CTE to

WITH tokens AS (
    SELECT COUNT(*) OVER () tokencount, token
    FROM (
        SELECT DISTINCT token
        FROM fts_tokenize('FORD 2010 BLACK')
    ) a
),
Vacationist answered 21/5, 2019 at 7:3 Comment(4)
Thank you, your solution is smart, but it has a poor performance than doing nested queries (of case 2) because your SQL searches for each token in the entire table, whereas SQL nested the next query only filters under the results of the previous query. (In my tests, your SQL took 7x more time and 40x more reads)Disjunction
The disadvantage of nested SQL in this case is that I did not previously know how many tokens will need to nestDisjunction
@Disjunction Right, I'll see if I can come up with another solution.Vacationist
I've tried using GTT to temporarily store results and use them to filter when fetching the next token, but the performance is worst working with INSERT/DELETE in GTTDisjunction
D
0

I think this is a simple case of double negation (I'm rephrasing your question to be that there should be no token that is not the beginning of a keyword), no need for a cte:

SELECT DISTINCT K.ID
FROM FTS_TOKENIZE ('FORD 2010 BLACK') FT
JOIN FTS_KEYWORDS K ON K.KEYWORD STARTING FT.TOKENS
WHERE NOT EXISTS(SELECT *
                 FROM FTS_TOKENIZE('FORD 2010 BLACK') FT2
                 WHERE NOT EXISTS(SELECT *
                                  FROM FTS_KEYWORDS K2
                                  WHERE K2.KEYWORD STARTING FT2.TOKENS
                                    AND K.ID = K2.ID))

HTH, Set

Designed answered 23/5, 2019 at 20:7 Comment(1)
Thanks, but your solution had a poor performance in my environment, in a 17 millions keywords table, it's takes 2-3 minutes to process..Disjunction

© 2022 - 2024 — McMap. All rights reserved.