match POS tag and sequence of words

Asked 12/1, 2016 at 8:30 Answered 12/1, 2016 at 11:44

I have the following two strings with their POS tags:

Sent1: "something like how writer pro or phraseology works would be really cool."

[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]

Sent2: "more options like the syntax editor would be nice"

[('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax', 'NN'), ('editor', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ')]

I am looking for a way to detect (return True) if there is the sequence: "would" + be" + adjective (regardless of the position of the adjective, as long as its after "would" "be") in these strings. In the second string the adjective, "nice" immediately follows "would be" but that is not the case in the first string.

The trivial case (no other word before the adjective; "would be nice") was solved in an earlier question of mine: detecting POS tag pattern along with specified words

I am now looking for a more general solution where optional words may occur before the adjective. I am new to NLTK and Python.

Liveryman answered 12/1, 2016 at 8:30 Comment(4)

@tripleee the funny thing is, it's the same OPer =) – Amin 12/1, 2016 at 12:19

It's not unusual to see the same question posted by the same person multiple times. If the OP is interested specifically in the difference between this question and the previous question, then the question requires a major overhaul to clarify that this is what it is about. – Piece 12/1, 2016 at 12:20

@Piece The idea is different, but the approach is the same. Instead of looking directly after 'would be', just search all the tags after 'would be' is successfully found for 'JJ'. – Yippee 12/1, 2016 at 12:31

Assuming ignorance before malice, I have updated the question to attempt to accommodate your dispute. This should now instead be closed as "too broad" because there is no attempt from the OP to implement the rather simple change this question is now focusing on. – Piece 12/1, 2016 at 12:38

First install the nltk_cli as per the instructions: https://github.com/alvations/nltk_cli

Then, here's a secret function in nltk_cli, maybe you'll find it useful:

alvas@ubi:~/git/nltk_cli$ cat infile.txt 
something like how writer pro or phraseology works would be really cool .
more options like the syntax editor would be nice
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+ADJP infile.txt 
would be    really cool
would be    nice

To illustrate other possible usage:

alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+VP infile.txt 
!!! NO CHUNK of VP+VP in this sentence !!!
!!! NO CHUNK of VP+VP in this sentence !!!
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 NP+VP infile.txt 
how writer pro or phraseology works would be
the syntax editor   would be
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP infile.txt 
!!! NO CHUNK of VP+NP in this sentence !!!
!!! NO CHUNK of VP+NP in this sentence !!!

Then if you want to check if the phrase in sentence and output True/False, simply read and iterate through the outputs from nltk_cli and check with if-else conditions.

Amin answered 12/1, 2016 at 11:32 Comment(8)

Note: Repeated chunk patterns are separated by | and first and second chunk tags within a pattern is separated by \t. Have fun! – Amin 12/1, 2016 at 11:35

Note that the solution works at a chunk level and not POS tags, possibly that's more appropriate since you want to capture phrases like "would be" + "really cool" instead of "would" + "be" + "really" + "cool" – Amin 12/1, 2016 at 11:41

This is probably what the OP really should be doing, but the question really asks for an adjective anywhere after "would be" so the ADJP constraint removes many of these cases (though all of these are probably false positives if we can guess what the OP really wants to accomplish). – Piece 12/1, 2016 at 12:52

chunking sounds interesting, however, I want to match all possible adjectives, so in this case "nice, awesome, great, cool" etc. That is why I wanted to keep the POS tag, or else, I'd have to think of a list of adjectives – Liveryman 12/1, 2016 at 13:15

I think this is the right answer. In @Amin 's first example verb phrases and adjective phrases are being chunked. – Yippee 12/1, 2016 at 13:28

ADJP would capture all possible adjectives. Hacking the nltk_cli a little would achieve the keeping of POS tags. But if you need help in the hack, I could take some time to allow POS outputs too, I've just got to think how to accommodate them such that I don't mess up the output format. Maybe XML/JSON output would be more appropriate. – Amin 12/1, 2016 at 13:29

can I specify which verb phrase (so for instance, I don't want to match all verb phrases but only certain ones? I'll have a closer look at the nltk_cli. Just want to check what is possible... – Liveryman 12/1, 2016 at 14:36

Sadly, senna doesn't have fine-grained VP. But you can easily filter afterwards, by the tokens (leaves) instead of the labels (nodes) =) – Amin 12/1, 2016 at 15:15

Would this help?

s1=[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]

flag = True
for i,j in zip(s1[:-1],s1[1:]):
    if i[0]+" "+j[0] == "would be":
        flag = True
    if flag and (i[-1] == "JJ" or j[-1] == "JJ"):
        print "would be adjective found in the tagged string"

Paraffinic answered 12/1, 2016 at 11:44 Comment(0)

it seem you would just search consecutive tags for "would" followed by "be" and then for the first instance of tag "JJ". Something like this:

import nltk

def has_would_be_adj(S):
    # make pos tags
    pos = nltk.pos_tag(S.split())
    # Search consecutive tags for "would", "be"
    j = None  # index of found "would"
    for i, (x, y) in enumerate(zip(pos[:-1], pos[1:])):
        if x[0] == "would" and y[0] == "be":
            j = i
            break
    if j is None or len(pos) < j + 2:
        return False
    a = None  # index of found adjective
    for i, (word, tag) in enumerate(pos[j + 2:]):
        if tag == "JJ":
            a = i+j+2 #
            break
    if a is None:
        return False
    print("Found adjective {} at {}", pos[a], a)
    return True

S = "something like how writer pro or phraseology works would be really cool."
print(has_would_be_adj(S))

I'm sure this could be written compacter and cleaner but it does what it says on the box :)

Kainite answered 12/1, 2016 at 10:4 Comment(0)

from itertools import tee,izip,dropwhile
import nltk
def check_sentence(S):
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return izip(a, b)


    def consecutive_would_be(word_group):
        first, second = word_group
        (would_word, _) = first
        (be_word, _) = second
        return would_word.lower() != "would" && be_word.lower() != "be"


    for word_groups in dropwhile(consecutive_would_be, pairwise(nltk.pos_tag(nltk.word_tokenize(S))):
        first, second = word_groups
        (_, pos1) = first
        (_, pos2) = second
        if pos1 == "JJ" || pos2 == "JJ":
            return True
    return False

and then you can use the function like so:

S = "more options like the syntax editor would be nice."  
check_sentence(S)

Baerman answered 12/1, 2016 at 11:13 Comment(0)

-1

Check StackOverflow Link

from nltk.tokenize import word_tokenize
def would_be(tagged):
    return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))

S = "more options like the syntax editor would be nice."  
pos = nltk.pos_tag(word_tokenize(S)) 
would_be(pos)

Also check code

from nltk.tokenize import word_tokenize
import nltk
def checkTag(S):
    pos = nltk.pos_tag(word_tokenize(S))
    flag = 0
    for tag in pos:
        if tag[1] == 'JJ':
            flag = 1
    if flag:
        for ind,tag in enumerate(pos):
            if tag[0] == 'would' and pos[ind+1][0] == 'be':
                    return True
        return False
    return False

S = "something like how writer pro or phraseology works would be really cool."
print checkTag(S)

Endothelium answered 12/1, 2016 at 10:38 Comment(2)

This code still doesn't work. This checks for 'would' followed by 'be' followed directly by an adjective. – Yippee 12/1, 2016 at 12:12

@Amin Perhaps plagiarized answer is more appropriate. – Yippee 12/1, 2016 at 12:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags