It is possible to search words in one sentence with Sphinx. For example, we have next text:
Вася молодец, съел огурец, т.к. проголодался. Такие дела.
If I search
молодец SENTENCE огурец
i find this text. If I search
молодец SENTENCE проголодался
I cant find this text, because dot from phrase т.к.
regarded as end of sentence.
And how I see, set of delimiters is hardcoded in Sphinx's sources.
My question is how to improve detection of sentence? Better way for me is to use Yandex's Tomita parser or another nlp library with smart detection of sentences.
shouldnt be considered a sentance boundary, as its an abriviation. - – Moolahт.к.
is not standard abbreviation for Sphinx? How to specify this abbreviation? Any way possible other situations: "Компании Yahoo! известна во всем мире." and other cases. I think what better way - to delegate segmentation to external library... – Clericals