Full text search for compound words
Asked Answered
F

3

3

I am looking into PostgreSQL full text search and would like to know if it is possible to search for the second part of compound words.

It there a way to get 'Cheesecake' as a result when I search for 'cake'?

-- Lets have a table like this:
CREATE TABLE IF NOT EXISTS table1(
    id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    col1 TEXT,
    col1_tsv TSVECTOR
);
CREATE INDEX IF NOT EXISTS col1_index ON table1 USING gin(col1_tsv);
-- Insert some data into it:
INSERT INTO table1 (col1, col1_tsv)
VALUES ('Blacklist', TO_TSVECTOR('Blacklist')),('Cheesecake', TO_TSVECTOR('Cheesecake'));

If I search for 'cake' or 'list' I get no results.

SELECT col1 FROM table1 WHERE col1_tsv @@ to_tsquery('english', 'list');
SELECT col1 FROM table1 WHERE col1_tsv @@ to_tsquery('english', 'cake');

Check it with ts_lexize:

select ts_lexize('english_stem','Blacklist');
select ts_lexize('english_stem','Cheesecake');

The output:

  ts_lexize  
-------------
 {blacklist}
(1 row)

  ts_lexize  
-------------
 {cheesecak}
(1 row)

Works as designed but is there a way to get the Cheesecake just from searching for cake? (And I do not mean)

select * from table1 where col1 like '%cake%';

When I select the whole table the Cheesecake is also cut into Cheesecak.

select * from table1;
 id |    col1    |   col1_tsv    
----+------------+---------------
  1 | Blacklist  | 'blacklist':1
  2 | Cheesecake | 'cheesecak':1
Fastness answered 15/7, 2021 at 23:25 Comment(0)
L
1

Full text search is capable of prefix matching. See:

But that only works for left-anchored searches. Your pattern is right-anchored.

You could build an index on reversed strings and search with reversed patterns:

CREATE INDEX table1_col1_rtsv_idx ON table1 USING gin (TO_TSVECTOR('simple', reverse(col1)));

Then this query with prefix search can use the new index:

SELECT col1 FROM table1
WHERE  to_tsvector('simple', reverse(col1))
    @@ to_tsquery('simple', reverse('cake') || ':*');

But I would consider a trigram index instead. See:

CREATE INDEX table1_col1_gin_trgm_idx ON table1 USING gin (col1 gin_trgm_ops);

Query:

SELECT col1 FROM table1
WHERE  col1 LIKE '%cake';

Notably, the pattern is '%cake', not '%cake%', if "cake" shall be at the end of the string. But the trigram index supports this just as well:

SELECT col1 FROM table1
WHERE  col1 LIKE '%cake%';

db<>fiddle here

Text search indexes are typically substantially smaller than trigram indexes - and therefore a bit faster. There are lots of other subtle differences ...

Locus answered 16/7, 2021 at 0:23 Comment(0)
S
4

There is a solution for such cases: you need a Hunspell dictionary for the languages you want to support. These dictionaries must also define compound word rules. If these requirements are met, Postgres can decompose compound words into their components and index them so that they become findable.

I show here an example for the language German, where many compound words are used:

  1. First we need a suitable Hunspell dictionary with compound word rules. After a while of researching, I found one: https://github.com/vpikulik/hunspell_de_compounds. You can see that a Hunspell dictionary defines compound word rules if its *.aiff file contains a line like compoundwords controlled _.

  2. Rename the file extensions to be *.affix and *.dict. Postgres expects it like this.

  3. Postgres wants, that Hunspell dictionaries are UTF8 encoded. Thus, I open the *.affix and *.dict files with Sublime Text, call File, Save with Encoding, UTF-8 for both files.

  4. You need to copy both files to the database machine (or container etc.) Open a terminal on the database machine, and move the files to the correct location:

    destination=$(echo $(pg_config --sharedir)/tsearch_data)
    mv de_DE.affix $destination
    mv de_DE.dict $destination
    

    Here, pg_config --sharedir yields the shared directory of the Postgres installation. The destination for the dictionary is the tsearch_data subdirectory.

  5. Connect to your database (locally or remote), e.g. locally by psql command.

  6. Now, we create (a) our own text search dictionary and (b) our own text search configuration inside of Postgres. We call both german_hunspell. Here is the code:

    • We drop a previously created configuration + dictionary. Just in case, we want to repeat the process e.g. because we want to use another dictionary.

      DROP TEXT SEARCH DICTIONARY german_hunspell CASCADE;
      
    • We create the dictionary:

      CREATE TEXT SEARCH DICTIONARY german_hunspell
      (TEMPLATE = ispell, DictFile = de_DE, AffFile = de_DE, Stopwords = german);
      

      Here, DictFile = de_DE, Postgres expects a file de_DE.dict; for AffFile = de_DE Postgres expects a file de_DE.affix.

    • We create a new text search configuration by deriving it from the german configuration provided in Postgres:

      CREATE TEXT SEARCH CONFIGURATION german_hunspell (COPY = german);
      
    • Next, we modify the configuration we created previously. We define that Postgres should use our new configuration for words of all kinds. If our Hunspell dictionary does not have any rules for a particular word, we forward the request to the default stemer for German:

      ALTER TEXT SEARCH CONFIGURATION german_hunspell
      ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
      word, hword, hword_part WITH german_hunspell, german_stem;
      
  7. Done. We can test that it works by using the ts_debug command:

    SELECT * FROM ts_debug('german_hunspell', 'Wettersystemsimulationssoftware');
    

    Here, Wettersystemsimulationssoftware is a German compound word. It gets splitted into wetter, system, simulation, software. When a user searches e.g. for system, Postgres would found this entry.

  8. In order to use our configuration, you have to specify it for any to_tsvector, to_tsquery, websearch_to_tsquery, etc. commands. Here are a few examples:

    SELECT to_tsvector('german_hunspell', 'content goes here');
    SELECT to_tsquery('german_hunspell', 'query goes here');
    SELECT websearch_to_tsquery('german_hunspell', 'query goes here');
    ...
    

    It works from most (any?) languages as well, e.g. by using C#, as long as the driver allows you to specify the configuration to use. Some driver like the C# ones using a binary interface for communication with the database. In such a case, you cannot address a configuration by its name e.g. german_hunspell. Instead, you have to query for it's OID like this:

    SELECT oid from pg_catalog.pg_ts_config where cfgname = 'german_hunspell';
    

    Then, you can cache this OID and work with it.

You can repeat this process with any language you want. Unfortunately, there are not appropriate Hunspell dictionaries for all languages available. I would like to have a dictionary for English. Suppose my record contains the word Spaceship, then I would like users to be able to search for Ship and find that record. Unfortunately, after several hours of research, I have not found a suitable English dictionary. This is where the open source community could get active...

Sherfield answered 15/10, 2021 at 9:30 Comment(1)
Thanks SommerEngineering for your answer, I did not check you suggestion yet but maybe I could help you out with an english dict and aff files: cgit.freedesktop.org/libreoffice/dictionaries/tree/en but I havent check yet if this will work (also I dotn know the difference between aff and affix)Fastness
L
1

Full text search is capable of prefix matching. See:

But that only works for left-anchored searches. Your pattern is right-anchored.

You could build an index on reversed strings and search with reversed patterns:

CREATE INDEX table1_col1_rtsv_idx ON table1 USING gin (TO_TSVECTOR('simple', reverse(col1)));

Then this query with prefix search can use the new index:

SELECT col1 FROM table1
WHERE  to_tsvector('simple', reverse(col1))
    @@ to_tsquery('simple', reverse('cake') || ':*');

But I would consider a trigram index instead. See:

CREATE INDEX table1_col1_gin_trgm_idx ON table1 USING gin (col1 gin_trgm_ops);

Query:

SELECT col1 FROM table1
WHERE  col1 LIKE '%cake';

Notably, the pattern is '%cake', not '%cake%', if "cake" shall be at the end of the string. But the trigram index supports this just as well:

SELECT col1 FROM table1
WHERE  col1 LIKE '%cake%';

db<>fiddle here

Text search indexes are typically substantially smaller than trigram indexes - and therefore a bit faster. There are lots of other subtle differences ...

Locus answered 16/7, 2021 at 0:23 Comment(0)
B
1

If you want to handle composites correctly and are not interested in substring matching, I think you need a thesaurus dictionary. For each composite you want to search for, you would have to add entries like

cheesecak : cheesecak chees cak
blacklist : blacklist black list

That way, you retain the original word and add its parts.

Annoying, but there is no automated way of detecting composites. For example, "havelock" has nothing to do with "lock", and a "haberdasher" need not "dash".

Beverleybeverlie answered 16/7, 2021 at 4:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.