How do I implement full text search in Chinese on PostgreSQL?
Asked Answered
W

3

15

This question has been asked before:

Postgresql full text search in postgresql - japanese, chinese, arabic

but there are no answers for Chinese as far as I can see. I took a look at the OpenOffice wiki, and it doesn't have a dictionary for Chinese.

Edit: As we are already successfully using PG's internal FTS engine for English documents, we don't want to move to an external indexing engine. Basically, what I'm looking for is a Chinese FTS configuration, including parser and dictionaries for Simplified Chinese (Mandarin).

Weinrich answered 22/10, 2010 at 6:43 Comment(1)
As we were unable to find a solution for this (even with the bounty I offered) we eventually moved to SQL Server, which natively supports Chinese FTS. Luckily our application was designed to be fairly DB vendor agnostic, so this wasn't a huge problem for us.Weinrich
G
6

I know it's an old question but there's a Postgres extension for Chinese: https://github.com/amutu/zhparser/

Giselegisella answered 21/5, 2015 at 9:25 Comment(2)
I'm getting text-search query contains only stop words or doesn't contain lexemes, ignored issues. See #41660409Supplejack
@Growler page not found.Cornhusking
I
3

I've just implemented a Chinese FTS solution in PostgreSQL. I did it by creating NGRAM tokens from Chinese input, and creating the necessary tsvectors using an embedded function (in my case I used plpythonu). It works very well (massively preferable to moving to SQL Server!!!).

Instillation answered 18/1, 2013 at 6:8 Comment(1)
Is it available anywhere or you are also using the zhparse above?Sweyn
B
1

Index your data with Solr, it's an open source enterprise search server built on top of Lucene.

You can find more info on Solr here:

http://lucene.apache.org/solr/

A good book on how-to (with PDF download immediately) here:

https://www.packtpub.com/solr-1-4-enterprise-search-server/book

And be sure to use a Chinese tokenizer, such as solr.ChineseTokenizerFactory because Chinese is not whitespace delimited.

Bodyguard answered 22/10, 2010 at 6:57 Comment(2)
We need to use the FTS engine built into Postgres. We have already successfully implemented English FTS, and want to continue to use the same system for Chinese documents.Weinrich
Oh, I see. Well, then my answer isn't helpful to you. I see your clarification/edit on the question since your original post. I'm not sure what your timeline will accomodate, but the Solr solutions are open source. You may be able to borrow from the ChineseTokenizerFactory -- it's logic overcomes the inherent problem as I understand it to be, that the language is not whitespace delimeted. Best of luck to you.Bodyguard

© 2022 - 2024 — McMap. All rights reserved.