I'm designing a fulltext search function in postgresql for my current project. It works ok with ispell/myspell dictionaries so far. Now I need to add support for chinese, japanese and arabic search. Where do I start? There are no templates or dictionaries available for those languages as far as I can see. Will it work with pg_catalog.simple configuration?
Just a hint from the manual: A large list of dictionaries is available on the OpenOffice Wiki.
Dictionaries won't help you too much with Chinese - you'll need to look in to NGRAM tokenising...
The similar solution of link at stackoverflow.com is How do I implement full text search in Chinese on PostgreSQL? .
Although that, I would provide a solution below in detail based on my experience and a solution on Internet. I use both tools of SCWS and zhparser as the solution of Chinese full-text search in postgres.
20160131 Update:
You must check whether you have installed postgresql-server-devel-{number version} because we will use pgxs function from it for creating extension in postgresql.
Step1: install SCWS.
It's remarkable that --prefix=/usr/local/scws follows ./configure . Not just has ./configure along in below 4th line.
wget http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2
tar xvjf scws-1.2.2.tar.bz2
cd scws-1.2.2
./configure --prefix=/usr/local/scws
make
make install
To check whether it installed successfully, please enter below command:
ls -al /usr/local/scws/lib/libscws.la
Step2: Install zhparser
git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws/include make && make install
20160131 Update:
If you use Mac OS X Yosemite, aboved value of SCWS_HOME is same. But if you use Ubuntu 14.04 LTS, please change value of SCWS_HOME to /usr/local/scws .
Step3: Configure a new extension using zhparser in Postres
Step3.1: Login your postgres database through terminal/commandline
psql yourdatabasename
Step3.2: Create extension in Postgres. You could specify what dictionary name you want.
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION dictionarynameyouwant (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION dictionarynameyouwant ADD MAPPING FOR n,v,a,i,e,l WITH simple;
If you follow above steps, you can use the function of Postgres full-text searching in Chinese/Mandarin words.
Extra step(not necessary) in Rails for using pg_search gem: Step4. Configure the dictionary name at :dictionary attribute of :tsearch in app/models/yourmodel.rb
class YourOwnClass < ActiveRecord::Base
...
include PgSearch
pg_search_scope :functionnameyoulike, :against => [columnsyoulike1, columnsyoulike2, ...,etc], :using => { :tsearch => {:dictionary => "dictionary name you just specified in creating a extension in postgres", blah blah blah, ..., etc} }
end
Reference:
1. SCWS install tutorial
2. [email protected]
3. Francs' Post - Postgres full-text search in Chinese with zhparser and SCWS
4. Rails365.net's Post - Postgres full-text search in Chinese with pg_search gem with zhparser
5. My Post at xuite.net - Make Postgres support full text search in Mandarin/Chinese
ZHParser
, running the query select title from knowledgebase_topics where to_tsvector('testzhcfg',title) @@ to_tsquery('testzhcfg','说话');
gets me error text-search query contains only stop words or doesn't contain lexemes, ignored
even though my Chinese text contains lexemes - "说话" means speak. (#41660409) –
Armourer For those who are landing here for PostgreSQL fulltext search in Japanese, here is the way to do it on ubuntu:
Install the following softwares and development environments:
apt-get install libmecab-dev libmecab2 mecab-ipadic-utf8 mecab-utils libmecab-perl libtext-mecab-perl mecab mecab-jumandic-utf8
Download textsearch_ja from https://www.postgresql.org/ftp/projects/pgFoundry/textsearch-ja/textsearch_ja/9.0.0/
The following is for those using version 12 of PostgreSQL:
cd textsearch_ja-9.0.0
make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config
sudo make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config install
which will yield something like:
/bin/mkdir -p '/usr/lib/postgresql/12/lib'
/bin/mkdir -p '/usr/share/postgresql/12/contrib'
/usr/bin/install -c -m 755 textsearch_ja.so '/usr/lib/postgresql/12/lib/textsearch_ja.so'
/usr/bin/install -c -m 644 .//uninstall_textsearch_ja.sql textsearch_ja.sql '/usr/share/postgresql/12/contrib/'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode/textsearch_ja'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/ '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
/usr/bin/install -c -m 644 textsearch_ja.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_eucjp.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_utf8.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 pgut/pgut-be.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
cd '/usr/lib/postgresql/12/lib/bitcode' && /usr/lib/llvm-10/bin/llvm-lto -thinlto -thinlto-action=thinlink -o textsearch_ja.index.bc textsearch_ja/textsearch_ja.bc textsearch_ja/encoding_eucjp.bc textsearch_ja/encoding_utf8.bc textsearch_ja/pgut/pgut-be.bc
Then, in the automatically generated textsearch_ja.sql, change LANGUAGE='C'
to LANGUAGE='c'
(in lowercase):
perl -pi -E "s/LANGUAGE 'C'/LANGUAGE 'c'/" textsearch_ja.sql
Then, you can use the superuser to add it to PostgreSQL:
sudo -u postgres psql -f textsearch_ja.sql
and if you want to add it for an existing database:
sudo -u postgres psql -d my_database -f textsearch_ja.sql
which will yield something like:
SET
BEGIN
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
COMMENT
CREATE FUNCTION
CREATE TEXT SEARCH TEMPLATE
CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH CONFIGURATION
COMMENT
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
COMMIT
To test it, as per https://github.com/HiraokaHyperTools/textsearch_ja
SELECT ja_wakachi('分かち書きを行います。');
would yield:
ja_wakachi
----------------------------
分かち書き を 行い ます 。
(1 row)
SELECT furigana('漢字の読みをカタカナで返します。');
would yield:
furigana
--------------------------------------
カンジノヨミヲカタカナデカエシマス。
(1 row)
SELECT * FROM ts_debug('japanese', E'日\n本\n語\n文\n字\n中\nの\n改\n行\nは\n除\n去\n');
would yield:
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+--------+-----------------+---------------+----------
word | Word, all letters | 日本語 | {japanese_stem} | japanese_stem | {日本語}
word | Word, all letters | 文字 | {japanese_stem} | japanese_stem | {文字}
word | Word, all letters | 中 | {japanese_stem} | japanese_stem | {中}
blank | Space symbols | の | {} | |
word | Word, all letters | 改行 | {japanese_stem} | japanese_stem | {改行}
blank | Space symbols | は | {} | |
word | Word, all letters | 除去 | {japanese_stem} | japanese_stem | {除去}
(7 rows)
SELECT * FROM ts_debug('japanese', E'Line\nbreaks\nin\nEnglish\ntext\nare\nreserved.');
would yield:
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+-----------
asciiword | Word, all ASCII | Line | {english_stem} | english_stem | {line}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | breaks | {english_stem} | english_stem | {break}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | in | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | English | {english_stem} | english_stem | {english}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | text | {english_stem} | english_stem | {text}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | are | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | reserved | {english_stem} | english_stem | {reserv}
blank | Space symbols | . | {} | |
(14 rows)
SELECT * FROM ts_debug('japanese', '日本語とEnglishがmixedな文も解析OKです。');
would yield:
alias | description | token | dictionaries | dictionary | lexemes
-----------+-------------------+---------+-----------------+---------------+-----------
word | Word, all letters | 日本語 | {japanese_stem} | japanese_stem | {日本語}
blank | Space symbols | と | {} | |
asciiword | Word, all ASCII | English | {english_stem} | english_stem | {english}
blank | Space symbols | が | {} | |
asciiword | Word, all ASCII | mixed | {english_stem} | english_stem | {mix}
blank | Space symbols | な | {} | |
word | Word, all letters | 文 | {japanese_stem} | japanese_stem | {文}
blank | Space symbols | も | {} | |
word | Word, all letters | 解析 | {japanese_stem} | japanese_stem | {解析}
asciiword | Word, all ASCII | OK | {english_stem} | english_stem | {ok}
blank | Space symbols | です | {} | |
blank | Space symbols | 。 | {} | |
(12 rows)
SELECT s
FROM regexp_split_to_table(to_tsvector('japanese',
'語尾は基本形に戻されます。')::text, ' ') AS t(s)
ORDER BY s;
would yield:
s
------------
'れる':4
'基本形':2
'戻す':3
'語尾':1
(4 rows)
SELECT s
FROM regexp_split_to_table(to_tsvector('japanese',
'ユーザとユーザーは正規化されます。ミラーとミラは別扱い。')::text, ' ') AS t(s)
ORDER BY s;
would yield:
s
--------------
'する':5
'ミラー':7
'ミラ':8
'ユーザ':1,2
'れる':6
'別':9
'化':4
'扱い':10
'正規':3
(9 rows)
© 2022 - 2024 — McMap. All rights reserved.