Postgresql full text search in postgresql - japanese, chinese, arabic
Asked Answered
P

4

14

I'm designing a fulltext search function in postgresql for my current project. It works ok with ispell/myspell dictionaries so far. Now I need to add support for chinese, japanese and arabic search. Where do I start? There are no templates or dictionaries available for those languages as far as I can see. Will it work with pg_catalog.simple configuration?

Pickering answered 22/3, 2010 at 21:44 Comment(0)
S
4

Just a hint from the manual: A large list of dictionaries is available on the OpenOffice Wiki.

Sot answered 23/3, 2010 at 7:43 Comment(1)
It looks like the link is out-of-date.Ruff
G
3

Dictionaries won't help you too much with Chinese - you'll need to look in to NGRAM tokenising...

Graphics answered 29/11, 2010 at 9:4 Comment(0)
E
3

The similar solution of link at stackoverflow.com is How do I implement full text search in Chinese on PostgreSQL? .

Although that, I would provide a solution below in detail based on my experience and a solution on Internet. I use both tools of SCWS and zhparser as the solution of Chinese full-text search in postgres.

20160131 Update:
You must check whether you have installed postgresql-server-devel-{number version} because we will use pgxs function from it for creating extension in postgresql.

Step1: install SCWS.
It's remarkable that --prefix=/usr/local/scws follows ./configure . Not just has ./configure along in below 4th line.

wget http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2
tar xvjf scws-1.2.2.tar.bz2
cd scws-1.2.2
./configure --prefix=/usr/local/scws 
make
make install

To check whether it installed successfully, please enter below command:

ls -al /usr/local/scws/lib/libscws.la


Step2: Install zhparser

git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws/include make && make install

20160131 Update: If you use Mac OS X Yosemite, aboved value of SCWS_HOME is same. But if you use Ubuntu 14.04 LTS, please change value of SCWS_HOME to /usr/local/scws .

Step3: Configure a new extension using zhparser in Postres
Step3.1: Login your postgres database through terminal/commandline

psql yourdatabasename

Step3.2: Create extension in Postgres. You could specify what dictionary name you want.

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION dictionarynameyouwant (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION dictionarynameyouwant ADD MAPPING FOR n,v,a,i,e,l WITH simple;


If you follow above steps, you can use the function of Postgres full-text searching in Chinese/Mandarin words.

Extra step(not necessary) in Rails for using pg_search gem: Step4. Configure the dictionary name at :dictionary attribute of :tsearch in app/models/yourmodel.rb

class YourOwnClass < ActiveRecord::Base
    ...
    include PgSearch
    pg_search_scope :functionnameyoulike, :against => [columnsyoulike1, columnsyoulike2, ...,etc], :using => { :tsearch => {:dictionary => "dictionary name you just specified in creating a extension in postgres", blah blah blah, ..., etc} }
end

Reference:
1. SCWS install tutorial
2. [email protected]
3. Francs' Post - Postgres full-text search in Chinese with zhparser and SCWS
4. Rails365.net's Post - Postgres full-text search in Chinese with pg_search gem with zhparser
5. My Post at xuite.net - Make Postgres support full text search in Mandarin/Chinese

Ecker answered 11/1, 2016 at 6:12 Comment(2)
I'm having issues with Chinese Full Text Search extension ZHParser, running the query select title from knowledgebase_topics where to_tsvector('testzhcfg',title) @@ to_tsquery('testzhcfg','说话'); gets me error text-search query contains only stop words or doesn't contain lexemes, ignored even though my Chinese text contains lexemes - "说话" means speak. (#41660409)Armourer
Have you installed and configured with postgresql-server-devel-* and SCWS ?Ecker
P
1

For those who are landing here for PostgreSQL fulltext search in Japanese, here is the way to do it on ubuntu:

Install the following softwares and development environments:

apt-get install libmecab-dev libmecab2 mecab-ipadic-utf8 mecab-utils libmecab-perl libtext-mecab-perl mecab mecab-jumandic-utf8

Download textsearch_ja from https://www.postgresql.org/ftp/projects/pgFoundry/textsearch-ja/textsearch_ja/9.0.0/

The following is for those using version 12 of PostgreSQL:

cd textsearch_ja-9.0.0
make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config
sudo make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config install

which will yield something like:

/bin/mkdir -p '/usr/lib/postgresql/12/lib'
/bin/mkdir -p '/usr/share/postgresql/12/contrib'
/usr/bin/install -c -m 755  textsearch_ja.so '/usr/lib/postgresql/12/lib/textsearch_ja.so'
/usr/bin/install -c -m 644 .//uninstall_textsearch_ja.sql textsearch_ja.sql '/usr/share/postgresql/12/contrib/'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode/textsearch_ja'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/ '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
/usr/bin/install -c -m 644 textsearch_ja.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_eucjp.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_utf8.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 pgut/pgut-be.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
cd '/usr/lib/postgresql/12/lib/bitcode' && /usr/lib/llvm-10/bin/llvm-lto -thinlto -thinlto-action=thinlink -o textsearch_ja.index.bc textsearch_ja/textsearch_ja.bc textsearch_ja/encoding_eucjp.bc textsearch_ja/encoding_utf8.bc textsearch_ja/pgut/pgut-be.bc

Then, in the automatically generated textsearch_ja.sql, change LANGUAGE='C' to LANGUAGE='c' (in lowercase):

perl -pi -E "s/LANGUAGE 'C'/LANGUAGE 'c'/" textsearch_ja.sql

Then, you can use the superuser to add it to PostgreSQL:

sudo -u postgres psql -f textsearch_ja.sql

and if you want to add it for an existing database:

sudo -u postgres psql -d my_database -f textsearch_ja.sql

which will yield something like:

SET
BEGIN
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
COMMENT
CREATE FUNCTION
CREATE TEXT SEARCH TEMPLATE
CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH CONFIGURATION
COMMENT
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
COMMIT

To test it, as per https://github.com/HiraokaHyperTools/textsearch_ja

SELECT ja_wakachi('分かち書きを行います。');

would yield:

         ja_wakachi
----------------------------
 分かち書き を 行い ます 。
(1 row)
SELECT furigana('漢字の読みをカタカナで返します。');

would yield:

               furigana
--------------------------------------
 カンジノヨミヲカタカナデカエシマス。
(1 row)
SELECT * FROM ts_debug('japanese', E'日\n本\n語\n文\n字\n中\nの\n改\n行\nは\n除\n去\n');

would yield:

 alias |    description    | token  |  dictionaries   |  dictionary   | lexemes
-------+-------------------+--------+-----------------+---------------+----------
 word  | Word, all letters | 日本語 | {japanese_stem} | japanese_stem | {日本語}
 word  | Word, all letters | 文字   | {japanese_stem} | japanese_stem | {文字}
 word  | Word, all letters | 中     | {japanese_stem} | japanese_stem | {中}
 blank | Space symbols     | の     | {}              |               |
 word  | Word, all letters | 改行   | {japanese_stem} | japanese_stem | {改行}
 blank | Space symbols     | は     | {}              |               |
 word  | Word, all letters | 除去   | {japanese_stem} | japanese_stem | {除去}
(7 rows)
SELECT * FROM ts_debug('japanese', E'Line\nbreaks\nin\nEnglish\ntext\nare\nreserved.');

would yield:

   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | Line     | {english_stem} | english_stem | {line}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | breaks   | {english_stem} | english_stem | {break}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | in       | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | English  | {english_stem} | english_stem | {english}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | text     | {english_stem} | english_stem | {text}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | are      | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | reserved | {english_stem} | english_stem | {reserv}
 blank     | Space symbols   | .        | {}             |              |
(14 rows)
SELECT * FROM ts_debug('japanese', '日本語とEnglishがmixedな文も解析OKです。');

would yield:

   alias   |    description    |  token  |  dictionaries   |  dictionary   |  lexemes
-----------+-------------------+---------+-----------------+---------------+-----------
 word      | Word, all letters | 日本語  | {japanese_stem} | japanese_stem | {日本語}
 blank     | Space symbols     | と      | {}              |               |
 asciiword | Word, all ASCII   | English | {english_stem}  | english_stem  | {english}
 blank     | Space symbols     | が      | {}              |               |
 asciiword | Word, all ASCII   | mixed   | {english_stem}  | english_stem  | {mix}
 blank     | Space symbols     | な      | {}              |               |
 word      | Word, all letters | 文      | {japanese_stem} | japanese_stem | {文}
 blank     | Space symbols     | も      | {}              |               |
 word      | Word, all letters | 解析    | {japanese_stem} | japanese_stem | {解析}
 asciiword | Word, all ASCII   | OK      | {english_stem}  | english_stem  | {ok}
 blank     | Space symbols     | です    | {}              |               |
 blank     | Space symbols     | 。      | {}              |               |
(12 rows)
SELECT s
  FROM regexp_split_to_table(to_tsvector('japanese',
 '語尾は基本形に戻されます。')::text, ' ') AS t(s)
  ORDER BY s;

would yield:

     s
------------
 'れる':4
 '基本形':2
 '戻す':3
 '語尾':1
(4 rows)
SELECT s
   FROM regexp_split_to_table(to_tsvector('japanese',
 'ユーザとユーザーは正規化されます。ミラーとミラは別扱い。')::text, ' ') AS t(s)
  ORDER BY s;

would yield:

      s
--------------
 'する':5
 'ミラー':7
 'ミラ':8
 'ユーザ':1,2
 'れる':6
 '別':9
 '化':4
 '扱い':10
 '正規':3
(9 rows)
Paraph answered 1/5, 2023 at 23:15 Comment(1)
In case anyone is interested this also works with PostgreSQL 15.Inflationary

© 2022 - 2024 — McMap. All rights reserved.