Getting the basic form of the english word

L

2

10

I am trying to get the basic english word for an english word which is modified from its base form. This question had been asked here, but I didnt see a proper answer, so I am trying to put it this way. I tried 2 stemmers and one lemmatizer from NLTK package which are porter stemmer, snowball stemmer, and wordnet lemmatiser.

I tried this code:

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

words = ['arrival','conclusion','ate']

for word in words:
    print "\n\nOriginal Word =>", word
    print "porter stemmer=>", PorterStemmer().stem(word)
    snowball_stemmer = SnowballStemmer("english")
    print "snowball stemmer=>", snowball_stemmer.stem(word)
    print "WordNet Lemmatizer=>", WordNetLemmatizer().lemmatize(word)

This is the output I get:

Original Word => arrival
porter stemmer=> arriv
snowball stemmer=> arriv
WordNet Lemmatizer=> arrival


Original Word => conclusion
porter stemmer=> conclus
snowball stemmer=> conclus
WordNet Lemmatizer=> conclusion


Original Word => ate
porter stemmer=> ate
snowball stemmer=> ate
WordNet Lemmatizer=> ate

but I want this output

    Input : arrival
    Output: arrive

    Input : conclusion
    Output: conclude

    Input : ate
    Output: eat

How can I achieve this? Are there any tools already available for this? This is called as morphological analysis. I am aware of that, but there must be some tools which are already achieving this. Help is appreciated :)

First Edit

I tried this code

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn

query = "The Indian economy is the worlds tenth largest by nominal GDP and third largest by purchasing power parity"

def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']

def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']

def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']

def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return wn.NOUN

tags = nltk.pos_tag(word_tokenize(query))
for tag in tags:
    wn_tag = penn_to_wn(tag[1])
    print tag[0]+"---> "+WordNetLemmatizer().lemmatize(tag[0],wn_tag)

Here, I tried to use wordnet lemmatizer by providing proper tags. Here is the output:

The---> The
Indian---> Indian
economy---> economy
is---> be
the---> the
worlds---> world
tenth---> tenth
largest---> large
by---> by
nominal---> nominal
GDP---> GDP
and---> and
third---> third
largest---> large
by---> by
purchasing---> purchase
power---> power
parity---> parity

Still, words like "arrival" and "conclusion" wont get processed with this approach. Is there any solution for this?

Lapland answered 7/11, 2014 at 7:1 Comment(0)

S

3

Try word_stemmer package, clone it from here and do pip install -e word_forms.

from word_forms.word_forms import get_word_forms
get_word_forms('conclusion')

# gives:
{'a': {'conclusive'},
 'n': {'conclusion', 'conclusions', 'conclusivenesses', 'conclusiveness'},
 'r': {'conclusively'},
 'v': {'concludes', 'concluded', 'concluding', 'conclude'}}

In your case, you'd like to get a verb form from a noun word form.

Sardella answered 22/5, 2018 at 12:3 Comment(1)

You also can get the basic form of word by using lemmatize("conclusion") – Debt 6/9, 2022 at 3:50

T

2

Ok, so... for the word "ate" I think you're looking for NodeBox::Linguistics.

print en.verb.present("gave")
>>> give

And I did not completely understand why do you want the verb or "arrival" but not the one of "conclusion".

Tiffanietiffanle answered 15/11, 2014 at 15:17 Comment(3)

I had came across nodebox before. You are right, I think in case of conclusion the base form should be 'conclude'. I will edit. – Lapland 17/11, 2014 at 7:28

It works perfect for words. But still I am having problems with words like "arrival" – Lapland 20/11, 2014 at 6:6

I've checked it and you're right. remember "NodeBox English Linguistics knows the verb tenses for about 10000 English verbs". It might not have all words so you have to do them manually. – Tiffanietiffanle 20/11, 2014 at 10:28

Recommended topics

Hot tags