How do I do word Stemming or Lemmatization?

Asked 21/4, 2009 at 10:7 Answered 11/4, 2023 at 15:14

114

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

Heaven answered 21/4, 2009 at 10:7 Comment(3)

Shouldn't that be cacti ? – Lent 21/4, 2009 at 11:19

Just to make a circular reference to the original question posted on Reddit: How do I programmatically do stemming? (e.g. "eating" to "eat", "cactuses" to "cactus") Posting it here because the comments include useful information. – Ishii 26/4, 2009 at 2:1

see https://mcmap.net/q/189870/-stemmers-vs-lemmatizers – Lombardy 9/3, 2014 at 14:56

145

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

Megalopolis answered 3/5, 2009 at 23:7 Comment(7)

Oh sad...before I knew to search S.O. I implemented my own! – Turboelectric 14/12, 2010 at 22:35

Do not forget to install the corpus before using nltk for the first time! velvetcache.org/2010/03/01/… – Nationalism 10/5, 2011 at 19:11

Well, this one uses some non-deterministic algorithm like Porter Stemmer, for if you try it with dies, it gives you dy instead of die. Isn't there some kind of hardcoded stemmer dictionary? – Normanormal 7/6, 2013 at 0:4

any idea what are the words that WordNetLemmatizer wrongly lemmatize? – Lombardy 27/6, 2013 at 11:51

In terms of performance (execution speed), is Lemmatization much slower than stemming? – Sayles 24/10, 2013 at 6:58

nltk WordNetLemmatizer requires a pos tag as argument. By default it is 'n' (standing for noun). So it will not work correctly for verbs. If POS tags are not available, a simple (but ad-hoc) approach is to do lemmatization twice, one for 'n', and the other for 'v' (standing for verb), and choose the result that is different from the original word (usually shorter in length, but 'ran' and 'run' have the same length). It seems that we don't need to worry about 'adj', 'adv', 'prep', etc, since they are already in the original form in some sense. – Stateroom 7/6, 2014 at 17:30

without POS, if input has, it output ha, so there are some problems on @Stateroom 's method – Schargel 25/12, 2014 at 11:29

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

Gage answered 2/3, 2012 at 10:47 Comment(3)

sorry for the late reply .. i got this issue solved only now ! :) – Gage 2/3, 2012 at 10:53

The line 'pipeline = new...' does not compile for me. If I change it to 'StanfordCoreNLP pipelne= new...' it compiles. Os this correct? – Hasen 27/10, 2013 at 20:46

Yes, you must declare the pipeline var first. The Stanford NLP can be used from command line as well so you don't have to do any programming, you just make the properties file and feed the executables with it. Read the docs: nlp.stanford.edu/software/corenlp.shtml – Kalin 3/7, 2014 at 13:29

I tried your list of terms on this snowball demo site and the results look okay....

cats -> cat
running -> run
ran -> ran
cactus -> cactus
cactuses -> cactus
community -> communiti
communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

Dusen answered 21/4, 2009 at 10:41 Comment(3)

The point is that stem("updates") == stem("update"), which it does (update -> updat) – Dusen 10/2, 2014 at 15:43

The software can do stem(x) == stem(y) but that's not answering the question completely – Avis 10/2, 2014 at 17:26

Careful with the lingo, a stem is not a base form of a word. If you want a base form, you need a lemmatizer. A stem is the largest part of a word that does not contain prefixes or suffixes. The stem of a word update is indeed "updat". The words are created from stems by adding endings and suffixes, e.g. updat-e, or updat-ing. (en.wikipedia.org/wiki/Word_stem) – Kalin 3/7, 2014 at 13:26

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.

See Stemmers vs Lemmatizers

Here's an example with python NLTK:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

Lombardy answered 9/3, 2014 at 15:3 Comment(5)

As mentioned before, WordNetLemmatizer's lemmatize() can take a POS tag. So from your example: " ".join([wnl.lemmatize(i, pos=VERB) for i in sent.split()]) gives 'cat run run cactus cactuses cacti community communities'. – Malave 18/5, 2015 at 9:3

@NickRuiz, I think you meant pos=NOUN? BTW: Long time no see, hopefully we'll meet each other in conference soon =) – Lombardy 18/5, 2015 at 9:50

actually, no (Hopefully 'yes' to conferences, though). Because if you set pos=VERB you only do lemmatization on verbs. The nouns remain the same. I just had to write some of my own code to pivot around the actual Penn Treebank POS tags to apply the correct lemmatization to each token. Also, WordNetLemmatizer stinks at lemmatizing nltk's default tokenizer. So examples like does n't do not lemmatize to do not. – Malave 19/5, 2015 at 11:45

but, but port.stem("this") produces thi and port.stem("was") wa, even when the right pos is provided for each. – Flourishing 29/8, 2018 at 2:45

A stemmer don't return linguistically sound outputs. It's just to make the text more "dense" (i.e. contain less vocab). See https://mcmap.net/q/189870/-stemmers-vs-lemmatizers and #51944311 – Lombardy 29/8, 2018 at 2:48

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.

If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

Ladyinwaiting answered 21/4, 2009 at 10:59 Comment(2)

An ideal solution would learn these expectations automatically. Have you had any experience with such a system? – Anemone 11/7, 2013 at 21:6

No. In our case the documents being indexed were the code & regulations for a specific area of law and there were dozens of (human) editors analyzing the indexes for any bad stems. – Ladyinwaiting 25/7, 2013 at 8:3

Based on various answers on Stack Overflow and blogs I've come across, this is the method I'm using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you'd like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.

You're sample above doesn't work too well, because the POS can't be determined. However, if we use a real sentence, things work much better.

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

Mayda answered 22/2, 2018 at 15:22 Comment(0)

http://wordnet.princeton.edu/man/morph.3WN

For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.

http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

Emeritaemeritus answered 21/4, 2009 at 16:42 Comment(0)

Look into WordNet, a large lexical database for the English language:

http://wordnet.princeton.edu/

There are APIs for accessing it in several languages.

Enslave answered 21/4, 2009 at 13:52 Comment(0)

This looks interesting: MIT Java WordnetStemmer: http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html

Rotator answered 29/10, 2012 at 6:37 Comment(1)

Welcome to SO, and thanks for your post, +1. It would be great if you could make a few comments on this stemmer's usage, performance etc. Just a link isn't usually considered a very good answer. – Bowie 29/10, 2012 at 9:24

Take a look at LemmaGen - open source library written in C# 3.0.

Results for your test words (http://lemmatise.ijs.si/Services)

cats -> cat
running
ran -> run
cactus
cactuses -> cactus
cacti -> cactus
community
communities -> community

Bala answered 17/5, 2014 at 22:27 Comment(0)

The top python packages (in no specific order) for lemmatization are: spacy, nltk, gensim, pattern, CoreNLP and TextBlob. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.

If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.

Lemmatization Example with spaCy:

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

Lemmatization Example With Gensim:

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

The above examples were borrowed from in this lemmatization page.

Prudish answered 7/10, 2018 at 17:19 Comment(0)

Do a search for Lucene, im not sure if theres a PHP port but I do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how it's done in one language so you can translate the "idea" into PHP.

Underwent answered 21/4, 2009 at 10:17 Comment(0)

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

Stodge answered 21/4, 2009 at 11:7 Comment(0)

The most current version of the stemmer in NLTK is Snowball.

You can find examples on how to use it here:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

Shishko answered 6/4, 2012 at 3:14 Comment(0)

You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the edu.washington.cs.knowitall.morpha.MorphaStemmer class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and run morpha and the word you want to stem.

new MorphaStemmer().morpha("climbed") // goes to "climb"

Housewarming answered 23/5, 2012 at 17:53 Comment(0)

.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)

Noctilucent answered 27/9, 2009 at 8:50 Comment(0)

Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.

He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.

From http://tartarus.org/~martin/PorterStemmer/index.html (emphasis mine)

The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.

Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as @StompChicken has answered earlier.

Take answered 8/3, 2014 at 10:59 Comment(0)

In Java, i use tartargus-snowball to stemming words

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

Sample code:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

Jacobine answered 15/12, 2014 at 4:7 Comment(0)

Try this one here: http://www.twinword.com/lemmatizer.php

I entered your query in the demo "cats running ran cactus cactuses cacti community communities" and got ["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"] with the optional flag ALL_TOKENS.

Sample Code

This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);

Chaim answered 22/4, 2015 at 3:45 Comment(0)

I highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy).

Lemmatized words are available by default in Spacy as a token's .lemma_ attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy. For example while creating a bag of terms or words or generally just before performing some processing that requires it.

I'd encourage you to check out both before writing any code, as this may save you a lot of time!

Dashiell answered 13/6, 2018 at 2:14 Comment(0)

import re
import pymorphy2
from pymorphy2 import MorphAnalyzer
import nltk
from nltk.tokenize import  word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ru = stopwords.words("russian")
morph = MorphAnalyzer()

def to_lowercase(data):
    
    data = data.lower()
return data

def noise_remove(data, remove_numbers=True):
    
    data = re.sub(r"(\w+:\/\/\S+)", " ", data)

    
    data = re.sub(r"([^0-9A-Za-zА-Яа-я])", " ", data)


if remove_numbers:
    data = re.sub(r"\d+", " ", data)
return data


def lemmatize(words):
    text = []
    
    for word in words:
        morph_word = morph.parse(word)[0]
        if morph_word.tag.POS in ['NOUN', 'ADJF', 'INFN', 'PRTS'] and morph_word[2] not in stopwords_ru:
            text.append(morph_word[2])
   return text

def tokenize(text):
    words = text.split()
    for elem in words:
        if len(elem) < 3:
            words.remove(elem)
    lemmatize_words = lemmatize(words)
    return ' '.join(lemmatize_words)

Elevation answered 11/4, 2023 at 15:14 Comment(1)

Please add some explanation for your code rather than posting code only. Additional explanation will be more helpful. – Vida 12/4, 2023 at 11:36

-1

df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))

Importunacy answered 9/5, 2018 at 21:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags