match POS tag and sequence of words
Asked Answered
L

5

0

I have the following two strings with their POS tags:

Sent1: "something like how writer pro or phraseology works would be really cool."

[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]

Sent2: "more options like the syntax editor would be nice"

[('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax', 'NN'), ('editor', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ')]

I am looking for a way to detect (return True) if there is the sequence: "would" + be" + adjective (regardless of the position of the adjective, as long as its after "would" "be") in these strings. In the second string the adjective, "nice" immediately follows "would be" but that is not the case in the first string.

The trivial case (no other word before the adjective; "would be nice") was solved in an earlier question of mine: detecting POS tag pattern along with specified words

I am now looking for a more general solution where optional words may occur before the adjective. I am new to NLTK and Python.

Liveryman answered 12/1, 2016 at 8:30 Comment(4)
@tripleee the funny thing is, it's the same OPer =)Amin
It's not unusual to see the same question posted by the same person multiple times. If the OP is interested specifically in the difference between this question and the previous question, then the question requires a major overhaul to clarify that this is what it is about.Piece
@Piece The idea is different, but the approach is the same. Instead of looking directly after 'would be', just search all the tags after 'would be' is successfully found for 'JJ'.Yippee
Assuming ignorance before malice, I have updated the question to attempt to accommodate your dispute. This should now instead be closed as "too broad" because there is no attempt from the OP to implement the rather simple change this question is now focusing on.Piece
A
3

First install the nltk_cli as per the instructions: https://github.com/alvations/nltk_cli

Then, here's a secret function in nltk_cli, maybe you'll find it useful:

alvas@ubi:~/git/nltk_cli$ cat infile.txt 
something like how writer pro or phraseology works would be really cool .
more options like the syntax editor would be nice
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+ADJP infile.txt 
would be    really cool
would be    nice

To illustrate other possible usage:

alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+VP infile.txt 
!!! NO CHUNK of VP+VP in this sentence !!!
!!! NO CHUNK of VP+VP in this sentence !!!
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 NP+VP infile.txt 
how writer pro or phraseology works would be
the syntax editor   would be
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP infile.txt 
!!! NO CHUNK of VP+NP in this sentence !!!
!!! NO CHUNK of VP+NP in this sentence !!!

Then if you want to check if the phrase in sentence and output True/False, simply read and iterate through the outputs from nltk_cli and check with if-else conditions.

Amin answered 12/1, 2016 at 11:32 Comment(8)
Note: Repeated chunk patterns are separated by | and first and second chunk tags within a pattern is separated by \t. Have fun!Amin
Note that the solution works at a chunk level and not POS tags, possibly that's more appropriate since you want to capture phrases like "would be" + "really cool" instead of "would" + "be" + "really" + "cool"Amin
This is probably what the OP really should be doing, but the question really asks for an adjective anywhere after "would be" so the ADJP constraint removes many of these cases (though all of these are probably false positives if we can guess what the OP really wants to accomplish).Piece
chunking sounds interesting, however, I want to match all possible adjectives, so in this case "nice, awesome, great, cool" etc. That is why I wanted to keep the POS tag, or else, I'd have to think of a list of adjectivesLiveryman
I think this is the right answer. In @Amin 's first example verb phrases and adjective phrases are being chunked.Yippee
ADJP would capture all possible adjectives. Hacking the nltk_cli a little would achieve the keeping of POS tags. But if you need help in the hack, I could take some time to allow POS outputs too, I've just got to think how to accommodate them such that I don't mess up the output format. Maybe XML/JSON output would be more appropriate.Amin
can I specify which verb phrase (so for instance, I don't want to match all verb phrases but only certain ones? I'll have a closer look at the nltk_cli. Just want to check what is possible...Liveryman
Sadly, senna doesn't have fine-grained VP. But you can easily filter afterwards, by the tokens (leaves) instead of the labels (nodes) =)Amin
P
1

Would this help?

s1=[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]

flag = True
for i,j in zip(s1[:-1],s1[1:]):
    if i[0]+" "+j[0] == "would be":
        flag = True
    if flag and (i[-1] == "JJ" or j[-1] == "JJ"):
        print "would be adjective found in the tagged string"
Paraffinic answered 12/1, 2016 at 11:44 Comment(0)
K
0

it seem you would just search consecutive tags for "would" followed by "be" and then for the first instance of tag "JJ". Something like this:

import nltk

def has_would_be_adj(S):
    # make pos tags
    pos = nltk.pos_tag(S.split())
    # Search consecutive tags for "would", "be"
    j = None  # index of found "would"
    for i, (x, y) in enumerate(zip(pos[:-1], pos[1:])):
        if x[0] == "would" and y[0] == "be":
            j = i
            break
    if j is None or len(pos) < j + 2:
        return False
    a = None  # index of found adjective
    for i, (word, tag) in enumerate(pos[j + 2:]):
        if tag == "JJ":
            a = i+j+2 #
            break
    if a is None:
        return False
    print("Found adjective {} at {}", pos[a], a)
    return True

S = "something like how writer pro or phraseology works would be really cool."
print(has_would_be_adj(S))

I'm sure this could be written compacter and cleaner but it does what it says on the box :)

Kainite answered 12/1, 2016 at 10:4 Comment(0)
B
0
from itertools import tee,izip,dropwhile
import nltk
def check_sentence(S):
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return izip(a, b)


    def consecutive_would_be(word_group):
        first, second = word_group
        (would_word, _) = first
        (be_word, _) = second
        return would_word.lower() != "would" && be_word.lower() != "be"


    for word_groups in dropwhile(consecutive_would_be, pairwise(nltk.pos_tag(nltk.word_tokenize(S))):
        first, second = word_groups
        (_, pos1) = first
        (_, pos2) = second
        if pos1 == "JJ" || pos2 == "JJ":
            return True
    return False

and then you can use the function like so:

S = "more options like the syntax editor would be nice."  
check_sentence(S)
Baerman answered 12/1, 2016 at 11:13 Comment(0)
E
-1

Check StackOverflow Link

from nltk.tokenize import word_tokenize
def would_be(tagged):
    return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))

S = "more options like the syntax editor would be nice."  
pos = nltk.pos_tag(word_tokenize(S)) 
would_be(pos)   

Also check code

from nltk.tokenize import word_tokenize
import nltk
def checkTag(S):
    pos = nltk.pos_tag(word_tokenize(S))
    flag = 0
    for tag in pos:
        if tag[1] == 'JJ':
            flag = 1
    if flag:
        for ind,tag in enumerate(pos):
            if tag[0] == 'would' and pos[ind+1][0] == 'be':
                    return True
        return False
    return False

S = "something like how writer pro or phraseology works would be really cool."
print checkTag(S)
Endothelium answered 12/1, 2016 at 10:38 Comment(2)
This code still doesn't work. This checks for 'would' followed by 'be' followed directly by an adjective.Yippee
@Amin Perhaps plagiarized answer is more appropriate.Yippee

© 2022 - 2024 — McMap. All rights reserved.