How to improve detection of sentences in Sphinx?

About

Asked 12/9, 2016 at 8:57 Answered 20/9, 2016 at 4:6

Solved full-text-search sphinx full-text-indexing

It is possible to search words in one sentence with Sphinx. For example, we have next text:

Вася молодец, съел огурец, т.к. проголодался. Такие дела.

If I search

молодец SENTENCE огурец

i find this text. If I search

молодец SENTENCE проголодался

I cant find this text, because dot from phrase т.к. regarded as end of sentence.

And how I see, set of delimiters is hardcoded in Sphinx's sources.

My question is how to improve detection of sentence? Better way for me is to use Yandex's Tomita parser or another nlp library with smart detection of sentences.

Clericals answered 12/9, 2016 at 8:57 Comment(4)

YEs, its hardcoded, but the rules around '.' should be setup such that T.K. shouldnt be considered a sentance boundary, as its an abriviation. sphinxsearch.com/docs/current.html#conf-index-sp - – Moolah 12/9, 2016 at 16:54

@barryhunter, yes, but т.к. is not standard abbreviation for Sphinx? How to specify this abbreviation? Any way possible other situations: "Компании Yahoo! известна во всем мире." and other cases. I think what better way - to delegate segmentation to external library... – Clericals 12/9, 2016 at 18:17

thats the thing according to rules, it should be counted as abrivation (as I understand it), its rule based rather than specific appriviations. As extending sphinx to use more extensive rules, would need to modify the source. – Moolah 13/9, 2016 at 9:44

@barryhunter, as I see, exists another problem with abbreviations: "Вот и пришла осень в U.S.A. В лесу медведи жуют ягоды.". Sphinx these two sentences glues into one... – Clericals 13/9, 2016 at 10:2

Split text into sentences with Yandex's Tomita parser. We get the text, which splited by "\n".

Delete all ".", "!", "?" leaving last from each sentence.

Build the Sphinx index with this preprocessed data.

Clericals answered 20/9, 2016 at 4:6 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags