How to use OpenNLP to get POS tags in R?
Asked Answered
G

3

7

Here is the R Code:

library(NLP) 
library(openNLP)
tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <-  tagPOS(str)

Output is :

tagged_str $POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."

Now I want to extract only NN word i.e sentence from the above sentence and want to store it into a variable .Can anyone help me out with this .

Guizot answered 23/6, 2015 at 6:19 Comment(1)
very nice function thanks for sharingThorin
C
2

There might be more elegant ways to obtain the result, but this one should work:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"

Hope this helps.

Corvese answered 23/6, 2015 at 8:14 Comment(4)
I can only work with the example you provide - and it works for what you have presented. If you encounter problems with other examples, feel free to post a new question.Corvese
Suppose the example are "2 tablespoons whole-egg mayonnaise" "1 teaspoon wholegrain mustard" "70g mixed salad leaves" "2 tomatoes, thinly sliced" "Bread and butter cucumbers, to serve" "90g Hakubaku organic dried soba noodles" "1 large carrot, peeled, cut into matchsticks" "1/2 bunch broccolini, cut into 5cm lengths" "60g baby corn, thinly sliced diagonally" try with these examples .It might not work in some casesGuizot
Thank you for your comment, @HimaanshuGauba . I'm sorry to hear that my suggested solution does not give the desired result in some of the cases that you have encountered. If the possibility of such pitfalls had become evident in your OP, I would have tried to provide an answer that works for these cases as well.Corvese
Anyways @Corvese and sorry for my late response to your comment As for the example that I have mentioned in my question (not comment)your code works fine so I have market this answer right .Cheers :)Guizot
S
7

Here is a more general solution, where you can describe the Treebank tag you desire to extract using a regular expression. In your case for instance, "NN" returns all noun types (e.g. NN, NNS, NNP, NNPS) while "NN$" returns just NN.

It operates on a character type, so if you have your texts as a list, you will need to lapply() it as in the examples below.

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

extractPOS <- function(x, thisPOSregex) {
    x <- as.String(x)
    wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
    POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
    POSwords <- subset(POSAnnotation, type == "word")
    tags <- sapply(POSwords$features, '[[', "POS")
    thisPOSindex <- grep(thisPOSregex, tags)
    tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
    untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
    untokenizedAndTagged
}

lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
## 
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
## 
## [[2]]
## [1] ""
Sterne answered 23/6, 2015 at 10:30 Comment(0)
S
3

Here is another answer that uses the spaCy parser and tagger, from Python, and the spacyr package to call it.

This library is orders of magnitude faster and almost as good as the stanford NLP models. It is still incomplete in some languages, but for english is a pretty good and promising option.

You first need to have Python installed and to have installed spaCy and a language module. Instructions are available from the spaCy page and here.

Then:

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)

spacy_parse(txt, pos = TRUE, tag = TRUE)
##    doc_id sentence_id token_id   token   lemma   pos tag   entity
## 1   text1           1        1    This    this   DET  DT         
## 2   text1           1        2      is      be  VERB VBZ         
## 3   text1           1        3       a       a   DET  DT         
## 4   text1           1        4   short   short   ADJ  JJ         
## 5   text1           1        5 tagging tagging  NOUN  NN         
## 6   text1           1        6 example example  NOUN  NN         
## 7   text1           1        7       ,       , PUNCT   ,         
## 8   text1           1        8      by      by   ADP  IN         
## 9   text1           1        9    John    john PROPN NNP PERSON_B
## 10  text1           1       10     Doe     doe PROPN NNP PERSON_I
## 11  text1           1       11       .       . PUNCT   .         
## 12  text2           1        1     Too     too   ADV  RB         
## 13  text2           1        2     bad     bad   ADJ  JJ         
## 14  text2           1        3 OpenNLP opennlp PROPN NNP         
## 15  text2           1        4      is      be  VERB VBZ         
## 16  text2           1        5      so      so   ADV  RB         
## 17  text2           1        6    slow    slow   ADJ  JJ         
## 18  text2           1        7      on      on   ADP  IN         
## 19  text2           1        8   large   large   ADJ  JJ         
## 20  text2           1        9   texts    text  NOUN NNS         
## 21  text2           1       10       .       . PUNCT   . 
Sterne answered 24/6, 2017 at 15:22 Comment(0)
C
2

There might be more elegant ways to obtain the result, but this one should work:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"

Hope this helps.

Corvese answered 23/6, 2015 at 8:14 Comment(4)
I can only work with the example you provide - and it works for what you have presented. If you encounter problems with other examples, feel free to post a new question.Corvese
Suppose the example are "2 tablespoons whole-egg mayonnaise" "1 teaspoon wholegrain mustard" "70g mixed salad leaves" "2 tomatoes, thinly sliced" "Bread and butter cucumbers, to serve" "90g Hakubaku organic dried soba noodles" "1 large carrot, peeled, cut into matchsticks" "1/2 bunch broccolini, cut into 5cm lengths" "60g baby corn, thinly sliced diagonally" try with these examples .It might not work in some casesGuizot
Thank you for your comment, @HimaanshuGauba . I'm sorry to hear that my suggested solution does not give the desired result in some of the cases that you have encountered. If the possibility of such pitfalls had become evident in your OP, I would have tried to provide an answer that works for these cases as well.Corvese
Anyways @Corvese and sorry for my late response to your comment As for the example that I have mentioned in my question (not comment)your code works fine so I have market this answer right .Cheers :)Guizot

© 2022 - 2024 — McMap. All rights reserved.