Where to find Ukrainian 'ispell', 'aspell', 'snowball' dictionary for adding it to full-text search in Postgres?
Asked Answered
P

1

9

After parsing many documents, I have a lot of rows/columns with Ukrainian text that should be indexed for full-text search in Postgres.

I've found that Postgres 14 supports by default 29 languages, but unfortunately not the Ukrainian one.

After subsequent digging, I've found that it allows adding an external dictionary:

CREATE TEXT SEARCH DICTIONARY my_lang_ispell (
    TEMPLATE = ispell,
    DictFile = path_to_my_lang_dict_file,
    AffFile = path_to_my_lang_affixes_file,
    StopWords = path_to_my_lang_astop_words_file
);

But how to find the most relevant DictFile, AffFile, and StopWords files? For example, snowball source doesn't contain this language.

So, could anyone help me find the best way to obtain ispell, aspell, snowball, or another dictionary for the Ukrainian language?

Thanks!

Poteet answered 30/9, 2022 at 10:22 Comment(1)
Perhaps you have to create them yourself.Vashtivashtia
P
12

After more deep exploring, found the solution on this resource dict_uk

  1. Compile files manually by this guide:
sudo snap install gradle

$ cd dict_uk
$ ./gradlew expand

$ cd distr/hunspell/

$ ../../gradlew hunspell

$ sudo cp build/hunspell/uk_UA.aff /usr/share/postgresql/12/tsearch_data/uk_ua.affix
$ sudo cp build/hunspell/uk_UA.dic /usr/share/postgresql/12/tsearch_data/uk_ua.dict
$ sudo cp ../postgresql/ukrainian.stop /usr/share/postgresql/12/tsearch_data/ukrainian.stop

Or just download & unzip the latest hunspell-uk_UA_X.X.X.zip from here and stop words file

  1. Follow the guide of setting up ukrainian language in the Postgres:
$ sudo cp uk_UA.aff $(pg_config --sharedir)/tsearch_data/uk_ua.affix
$ sudo cp uk_UA.dic $(pg_config --sharedir)/tsearch_data/uk_ua.dict
$ sudo cp ukrainian.stop $(pg_config --sharedir)/tsearch_data/ukrainian.stop
$ sudo su postgres
$ psql

CREATE TEXT SEARCH DICTIONARY ukrainian_huns (TEMPLATE = ispell, DictFile = uk_ua, AffFile = uk_ua, StopWords = ukrainian);

CREATE TEXT SEARCH DICTIONARY ukrainian_stem (template = simple, stopwords = ukrainian);

CREATE TEXT SEARCH CONFIGURATION ukrainian (PARSER=default);

ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR  hword, hword_part, word WITH ukrainian_huns, ukrainian_stem;

ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR  int, uint, numhword, numword, hword_numpart, email, float, file, url, url_path, version, host, sfloat WITH simple;

ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR asciihword, asciiword, hword_asciipart WITH english_stem;

# \dFd
...
 pg_catalog | english_stem    | snowball stemmer for english language
...
 public     | ukrainian_huns  | 
 public     | ukrainian_stem  | 

Now it is available for creating a searchable column with help of to_tsvector:

      ALTER TABLE extracted_pages
      ADD COLUMN tsvector_uk tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('ukrainian', coalesce(column_with_text, '')), 'A')
      ) STORED;

This example shows the correct stemming for the Ukrainian language:

SELECT to_tsvector('ukrainian', 'солодко дзюрчить джерело і хочеться жити, любити, творити... ');
 => [{"to_tsvector"=>"'джерело':3 'дзюрчати':2 'жити':6 'любити':7 'солодко':1 'творити':8 'хочеться':5"}]

Results

The Postgres full-text search works as well as similar search text engine SphinxSearch in terms of quality, but it is a bit slower.

On the same query from the huge amount of records (278_000) it returns the same results:

Postgres     - ActiveRecord: 67.6ms
SphinxSearch - ActiveRecord: 10.9ms

OS: Ubuntu 20.04

Thank you very much, dict_uk support team!

Poteet answered 30/9, 2022 at 19:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.