How to detect sentence boundaries with OpenNLP and stringi?
Asked Answered
M

2

12

I want to break next string into sentences:

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

I want to demonstrate two different ways. One comes from package openNLP:

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string, sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."  

And second comes from package stringi:

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

After this second way I need to prepare sentences to remove extra spaces or break a new string into sentences again. Can I adjust stringi function to improve result's quality?

When it is about a big data, openNLP is (very much) slower then stringi.
Is there a way to combine stringi (->fast) and openNLP (->quality)?

Mcmath answered 6/8, 2015 at 20:47 Comment(3)
if you don't get an answer here, you may have luck on the corpus linguistics with R forumBreeding
I opened this as an issue on stringi's HitHub page as well: github.com/Rexamine/stringi/issues/184Trilogy
OpenNLP and stringi differ from each other about how to detect sentence boundaries. stringi seems work with a set of rules. And openNLP works with a model from a learning proces. But I still don't see where bottle neck lies...Mcmath
F
9

Text boundary (in this case, sentence boundary) analysis in ICU (and thus in stringi) is governed by the rules described in Unicode UAX29, see also ICU Users Guide on the topic. We read:

[The Unicode rules] cannot detect cases such as “...Mr. Jones...”; more sophisticated tailoring would be required to detect such cases.

In other words, this cannot be done without a custom dictionary of non-stop words, which in fact is implemented in openNLP. A few possible scenarios to incorporate stringi for performing this task would therefore include:

  1. Use stri_split_boundaries and then write a function deciding on which incorrectly split tokens should be joined.
  2. Manually input non-breaking spaces into the text (possibly after dots following etc., Mr., i.e. and so on (note that this in fact is required when preparing documents in LaTeX -- otherwise you get too huge spaces between words).
  3. Incorporate a custom non-stop word list into a regex and apply the stri_split_regex.

and so on.

Fusco answered 11/8, 2015 at 9:28 Comment(1)
This inspired a better solution below that you may be able to incorporate into stringi at some point.Trilogy
T
5

This may be a viable regex solution:

string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")

## [[1]]
## [1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

Performs less well on:

string <- "Mr. Brown comes! He says hello. i give him coffee.  i will got at 5 p. m. eastern time.  Or somewhere in between"
Trilogy answered 11/8, 2015 at 1:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.