Extracting noun+noun or (adj|noun)+noun from Text
Asked Answered
G

2

8

Is it possible to extract noun+noun or (adj|noun)+noun using the R package openNLP? That is, I would like to use linguistic filtering to extract candidate noun phrases. Could you direct me how to do? Many thanks.


Thanks for the responses. here is the code:

library("openNLP")

acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in
        pipeline and terminal operations for 12.2 mln dlrs. The company said 
        the sale is subject to certain post closing adjustments, 
        which it did not explain. Reuter." 

acqTag <- tagPOS(acq)    
acqTagSplit = strsplit(acqTag," ")
acqTagSplit

qq = 0
tag = 0

for (i in 1:length(acqTagSplit[[1]])){
    qq[i] <-strsplit(acqTagSplit[[1]][i],'/')
    tag[i] = qq[i][[1]][2]
}

index = 0

k = 0

for (i in 1:(length(acqTagSplit[[1]])-1)) {
    
    if ((tag[i] == "NN" && tag[i+1] == "NN") | 
        (tag[i] == "NNS" && tag[i+1] == "NNS") | 
        (tag[i] == "NNS" && tag[i+1] == "NN") | 
        (tag[i] == "NN" && tag[i+1] == "NNS") | 
        (tag[i] == "JJ" && tag[i+1] == "NN") | 
        (tag[i] == "JJ" && tag[i+1] == "NNS"))
    {      
            k = k +1
            index[k] = i
    }

}

index

Reader can refer index on acqTagSplit to do noun+noun or (adj|noun)+noun extraction. (The code is not optimal, but it works. If you have any idea, please let me know.)

I have an additional problem:

Justeson and Katz (1995) proposed another linguistic filtering to extract candidate noun phrases:

((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun

I cannot understand its meaning well. Could you do me a favor and explain it? Or show how to code the filtering rule in the R language? Many thanks.

Groggy answered 5/1, 2011 at 3:34 Comment(3)
Posted what I think is a clean solution. Your later request is a considerable extension of the original question. I think you should close this one out and ask another question.Tush
@DWin : I think not. It's just adding some extra conditions. Plus, to translate that to R would be a question for text miners, not for programmers. I suggest ssuhan to read the article of Justeson and Katz to get its meaning.Jerroldjerroll
@Joris: There were a couple of new operators "+" and "?" that I did not understand. I thought they might translate to regex in some fashion, unknown to me however, and the citation was unavailable on a Web search. So I thought that reposting would be a better approach, since the original question had been answered both by the OP and my efforts at streamlining.Tush
H
3

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like: for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

Hammon answered 5/1, 2011 at 3:47 Comment(2)
Thanks carlosdc. Could you kindly give me some direction to program such a process?Groggy
Thanks carlosdc. I write some codes following your direction. Could you please give me some recommendation again? Thanks very much.Groggy
T
5

I don't have an open console on which to test this, but have your tried to tokenize with tagPOS and then grep for "noun", "noun" or perhaps paste(tagPOS(acq), collapse=".") and search for "noun.noun". Then gregexpr could be used to extract positions.

EDIT: The format of the tagged output was a bit different than I remembered. I think this method of read.table()-ing after substituting "\n"s for spaces is much more efficient than what I see above:

 acqdf <- read.table(textConnection(gsub(" ", "\n", acqTag)), sep="/", stringsAsFactors=FALSE)
 acqdf$nnadj <- grepl("NN|JJ", acqdf$V2)
 acqdf$nnadj 
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[16] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#[31]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)]
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[16] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#[31] FALSE FALSE FALSE FALSE FALSE FALSE
 acqdf$pair <- c(NA, acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)])
 acqdf[1:7, ]

            V1  V2 nnadj  pair
1         Gulf NNP  TRUE    NA
2      Applied NNP  TRUE  TRUE
3 Technologies NNP  TRUE  TRUE
4          Inc NNP  TRUE  TRUE
5         said VBD FALSE FALSE
6           it PRP FALSE FALSE
7         sold VBD FALSE FALSE
Tush answered 5/1, 2011 at 4:6 Comment(2)
Thanks answered. Your idea is quite attracted me. But I am still a debutant of R. Could me please further give me detailed direciton? Many thanks.Groggy
Thanks DWin. How great you were! (Thumb's up)Groggy
H
3

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like: for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

Hammon answered 5/1, 2011 at 3:47 Comment(2)
Thanks carlosdc. Could you kindly give me some direction to program such a process?Groggy
Thanks carlosdc. I write some codes following your direction. Could you please give me some recommendation again? Thanks very much.Groggy

© 2022 - 2024 — McMap. All rights reserved.