Finding the position of Noun and Verb in a sentence Python

F

2

6

Is there a way to find the position of the words with pos-tag 'NN' and 'VB' in a sentence in Python?

example of a sentences in a csv file: "Man walks into a bar." "Cop shoots his gun." "Kid drives into a ditch"

Forman answered 9/3, 2022 at 12:26 Comment(0)

P

7

You can find positions for certein PoS tags on a text using some of the existing NLP frameworks such us Spacy or NLTK. Once you process the text you can iterate for each token and check if the pos tag is what you are looking for, then get the start/end position of that token in your text.

Spacy

Using spacy, the code to implement what you want would be something like this:

import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Man walks into a bar.")  # Your text here

words = []
for token in doc:
    if token.pos_ == "NOUN" or token.pos_ == "VERB":
        start = token.idx  # Start position of token
        end = token.idx + len(token)  # End position = start + len(token)
        words.append((token.text, start, end, token.pos_))

print(words)

In short, I build a new document from the string, iterate over all the tokens and keep only those whose post tag is VERB or NOUN. Finally I add the token info to a list for further processing. I strongly recommend that you read the following spacy tutorial for more information.

NLTK

Using NLTK I think is pretty straightforward too, using NLTK tokenizer and pos tagger. The rest is almost analogous to how we do it using spacy.

What I'm not sure about is the most correct way to get the start-end positions of each token. Note that for this I am using a tokenization helper created by WhitespaceTokenizer().tokenize() method, which returns a list of tuples with the start and end positions of each token. Maybe there is a simpler and NLTK-like way of doing it.

import nltk
from nltk.tokenize import WhitespaceTokenizer

text = "Man walks into a bar."  # Your text here
tokens_positions = list(WhitespaceTokenizer().span_tokenize(text))  # Tokenize to spans to get start/end positions: [(0, 3), (4, 9), ... ]
tokens = WhitespaceTokenizer().tokenize(text)  # Tokenize on a string lists: ["man", "walks", "into", ... ]

tokens = nltk.pos_tag(tokens) # Run Part-of-Speech tager

# Iterate on each token
words = []
for i in range(len(tokens)):
    text, tag = tokens[i]  # Get tag
    start, end = tokens_positions[i]  # Get token start/end
    if tag == "NN" or tag == "VBZ":
        words.append((start, end, tag))

print(words)

I hope this works for you!

Phipps answered 9/3, 2022 at 12:49 Comment(3)

Thank you this is really helpful, but what if I want to use a csv files then contains more than 1 sentences instead of assigning a sentence to text? – Forman 9/3, 2022 at 13:57

Hi, can you please edit your original question with code and example of what do you expect? I guess if you have a CSV file with multiple lines and you expect to have a result for each line you can process each line separately inside a for loop... So please give me more information so I can help you – Phipps 11/3, 2022 at 15:11

Sorry for the vague questions with little to none information. But here is a question i asked which might be clearer - https://mcmap.net/q/1772489/-position-of-that-noun-and-verb/17973460 – Forman 13/3, 2022 at 2:12

C

3

you should take a look at nltk.

From the doc:

import nltk
text = nltk.tokenize.word_tokenize("They refuse to permit us to obtain the refuse permit")


nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Cyanide answered 9/3, 2022 at 12:48 Comment(1)

nltk.org/book/ch05.html – Cyanide 9/3, 2022 at 12:48

Recommended topics

Hot tags