After more deep exploring, found the solution on this resource dict_uk
- Compile files manually by this guide:
sudo snap install gradle
$ cd dict_uk
$ ./gradlew expand
$ cd distr/hunspell/
$ ../../gradlew hunspell
$ sudo cp build/hunspell/uk_UA.aff /usr/share/postgresql/12/tsearch_data/uk_ua.affix
$ sudo cp build/hunspell/uk_UA.dic /usr/share/postgresql/12/tsearch_data/uk_ua.dict
$ sudo cp ../postgresql/ukrainian.stop /usr/share/postgresql/12/tsearch_data/ukrainian.stop
Or just download & unzip the latest hunspell-uk_UA_X.X.X.zip from here and stop words file
- Follow the guide of setting up
ukrainian
language in the Postgres:
$ sudo cp uk_UA.aff $(pg_config --sharedir)/tsearch_data/uk_ua.affix
$ sudo cp uk_UA.dic $(pg_config --sharedir)/tsearch_data/uk_ua.dict
$ sudo cp ukrainian.stop $(pg_config --sharedir)/tsearch_data/ukrainian.stop
$ sudo su postgres
$ psql
CREATE TEXT SEARCH DICTIONARY ukrainian_huns (TEMPLATE = ispell, DictFile = uk_ua, AffFile = uk_ua, StopWords = ukrainian);
CREATE TEXT SEARCH DICTIONARY ukrainian_stem (template = simple, stopwords = ukrainian);
CREATE TEXT SEARCH CONFIGURATION ukrainian (PARSER=default);
ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR hword, hword_part, word WITH ukrainian_huns, ukrainian_stem;
ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR int, uint, numhword, numword, hword_numpart, email, float, file, url, url_path, version, host, sfloat WITH simple;
ALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR asciihword, asciiword, hword_asciipart WITH english_stem;
# \dFd
...
pg_catalog | english_stem | snowball stemmer for english language
...
public | ukrainian_huns |
public | ukrainian_stem |
Now it is available for creating a searchable column with help of to_tsvector
:
ALTER TABLE extracted_pages
ADD COLUMN tsvector_uk tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('ukrainian', coalesce(column_with_text, '')), 'A')
) STORED;
This example shows the correct stemming for the Ukrainian language:
SELECT to_tsvector('ukrainian', 'солодко дзюрчить джерело і хочеться жити, любити, творити... ');
=> [{"to_tsvector"=>"'джерело':3 'дзюрчати':2 'жити':6 'любити':7 'солодко':1 'творити':8 'хочеться':5"}]
Results
The Postgres full-text search works as well as similar search text engine SphinxSearch in terms of quality, but it is a bit slower.
On the same query from the huge amount of records (278_000) it returns the same results:
Postgres - ActiveRecord: 67.6ms
SphinxSearch - ActiveRecord: 10.9ms
OS: Ubuntu 20.04
Thank you very much, dict_uk support team!