The reverse process of stemming
Asked Answered
E

4

12

I use a lucene snowball analyzer to perform stemming . The results are not meaningful words . I referred this question .

One of the solution is to use a database that contains a map between the stemmed version of the word to one stable version of the word . (Example from communiti to community no matter what the base was for communti (communities / or some other word))

I want to know if there is a database which performs such a function.

Etam answered 28/2, 2012 at 11:30 Comment(0)
B
6

It is theoretically impossible to recover a specific word from a stem, since one stem can be common to many words. One possibility, depending on your application, would be to build a database of stems each mapped to an array of several words. But you would then need to predict which one of those words is appropriate given a stem to re-convert.

As a very naive solution to this problem, if you know the word tags, you could try storing words with the tags in your database:

run:
   NN:  runner
   VBG: running
   VBZ: runs

Then, given the stem "run" and the tag "NN", you could determine that "runner" is the most probable word in that context. Of course, that solution is far from perfect. Notably, you'd need to handle the fact that the same word form might be tagged differently in different contexts. But remember that any attempt to solve this problem will be, at best, an approximation.

Edit: from the comments below, it looks like you probably want to use lemmatization instead of stemming. Here's how to get the lemmas of words using the Stanford Core NLP tools:

import java.util.*;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;

Properties props = new Properties();

props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props, false);
String text = "Hello, world!";
Annotation document = pipeline.process(text);

for(CoreMap sentence: document.get(SentencesAnnotation.class)) {
    for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
        String word = token.get(TextAnnotation.class);
        String lemma = token.get(LemmaAnnotation.class);
    }
}
Bandeen answered 28/2, 2012 at 22:21 Comment(5)
Words like run work fine with a stemmer . I am talking about words like efficieny . efficieny on stemming becomes effici which has no meaning . What i am planning to accompolish is convert effici to efficiency no matter what produced effici(efficient / some other word)Etam
Then you are probably looking for lemmatization (finding the "base" form of the word - what would be listed in the dictionary), not stemming (finding the "root" of the word). A stem can have many base words - "efficient" -> "effici" -> "efficiency" makes no sense. One lemma corresponds to exactly one base word: "efficient" -> "efficient", "efficiency" -> "efficiency", "efficiencies" -> "efficiency". In stemming, you lose both inflection and the base word. In lemmatization, you only lose inflection. The code I just added in the post should get you started with the Stanford lemmatization tools.Bandeen
Does stanford nlp perform stop word removal ?Etam
I don't believe so. Can you detail what you mean by stop word removal in this context?Bandeen
Given a sentence remove the words like (is was a and other common verbs)Etam
A
4

The question you are referencing contains an important piece of information which is often overlooked. What you require is known as "lemmatisation"- the reduction of inflected words to their canonical form. It is related but different from stemming and is still an open research question. It is particularly hard for languages with more complex morphology (English is not that hard). Wikipedia has a list of software you can try. Another tool I have used is TreeTagger- it is really fast and reasonably accurate, although it primary purpose is part-of-speech tagging and lemmatisation is just a bonus. Try googling for "statistical lemmatisation" (yes, I do have strong feelings about the statistical vs rule-based NLP)

Adrenocorticotropic answered 28/2, 2012 at 22:13 Comment(0)
D
1

You might look at the NCI Metathesaurus -- although mostly biomedical in nature, they offer examples of natural language processing and some open source toolsets for Java you might find useful by browsing their code.

Disinfest answered 28/2, 2012 at 21:55 Comment(0)
S
1

In case you want to do it in Python:

You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:

On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.

First, you will stem some documents, here short (French language) strings with their stop words removed for example: ['sup chat march trottoir', 'sup chat aiment ronron', 'chat ronron', 'sup chien aboi', 'deux sup chien', 'combien chien train aboi']

Then the trick is to have kept the count of the most popular original words with counts for each stemmed word: {'aboi': {'aboie': 1, 'aboyer': 1}, 'aiment': {'aiment': 1}, 'chat': {'chat': 1, 'chats': 2}, 'chien': {'chien': 1, 'chiens': 2}, 'combien': {'Combien': 1}, 'deux': {'Deux': 1}, 'march': {'marche': 1}, 'ronron': {'ronronner': 1, 'ronrons': 1}, 'sup': {'super': 4}, 'train': {'train': 1}, 'trottoir': {'trottoir': 1}}

Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:

Stibine answered 5/9, 2018 at 18:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.