Stemming English words with Lucene
Asked Answered
K

7

29

I'm processing some English texts in a Java application, and I need to stem them. For example, from the text "amenities/amenity" I need to get "amenit".

The function looks like:

String stemTerm(String term){
   ...
}

I've found the Lucene Analyzer, but it looks way too complicated for what I need. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/PorterStemFilter.html

Is there a way to use it to stem words without building an Analyzer? I don't understand all the Analyzer business...

EDIT: I actually need a stemming + lemmatization. Can Lucene do this?

Kcal answered 22/3, 2011 at 13:14 Comment(2)
Why do you need to stem the words yourself? Lucene has an analyzer called SnowballAnalyzer which you just instantiate with the stemmer name, e.g. new SnowballAnalyzer("English");.Cruz
Knuth-Pratt Algorithm Implementation fmi.uni-sofia.bg/fmi/logic/vboutchkova/sources/…Lumen
M
22
import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

See here for more details. If stemming is all you want to do, then you should use this instead of Lucene.

Edit: You should lowercase term before passing it to stem().

Molten answered 22/3, 2011 at 16:44 Comment(4)
Is it possible to combine the filter for stop words with the stemmer?Kcal
Do you want to filter stop words from a string with multiple words or have you already tokenised (separated) the words and want to check just a single word? If its just a single term like above, then just create a Set of all stop words and do a .contains().Molten
As of the current version of Lucene (3.5), PorterStemmer, although it exists, is not public. I'm not sure who/what uses it, but we can't.Circumflex
PorterStemmer no longer public (stupidly) - see also #15422985Blakeney
B
29

SnowballAnalyzer is deprecated, you can use Lucene Porter Stemmer instead:

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent(word);
 stem.stem();
 String result = stem.getCurrent();

Hope this help!

Biome answered 4/11, 2012 at 12:26 Comment(1)
PorterStemmer no longer public (stupidly) - see also #15422985Blakeney
M
22
import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

See here for more details. If stemming is all you want to do, then you should use this instead of Lucene.

Edit: You should lowercase term before passing it to stem().

Molten answered 22/3, 2011 at 16:44 Comment(4)
Is it possible to combine the filter for stop words with the stemmer?Kcal
Do you want to filter stop words from a string with multiple words or have you already tokenised (separated) the words and want to check just a single word? If its just a single term like above, then just create a Set of all stop words and do a .contains().Molten
As of the current version of Lucene (3.5), PorterStemmer, although it exists, is not public. I'm not sure who/what uses it, but we can't.Circumflex
PorterStemmer no longer public (stupidly) - see also #15422985Blakeney
P
7

Why aren't you using the "EnglishAnalyzer"? It's simple to use it and I think it'd solve your problem:

EnglishAnalyzer en_an = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "your_field", en_an);
String str = "amenities";
System.out.println("result: " + parser.parse(str)); //amenit

Hope it helps you!

Pathan answered 24/11, 2011 at 6:47 Comment(3)
What is this "your_field" doing? Documentation says a cryptic "the default field for query terms."Vicenta
That chops it down to words, but doesn't stem. Not for me at least.Fondness
It does very basic stemming. It does not take began and change it to begin.Gothar
A
5

The previous example applies stemming to a search query, so if you are interesting to stem a full text you can try the following:

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.snowball.*;
import org.apache.lucene.util.*;
...
public class Stemmer{
    public static String Stem(String text, String language){
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0){
            StringReader tReader = new StringReader(text);
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,language);
            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken()){
                    result.append(term.term());
                    result.append(" ");
                }
            } catch (IOException ioe){
                System.out.println("Error: "+ioe.getMessage());
            }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();
    }

    public static void main (String[] args){
        Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
    }
}

The TermAttribute class has been deprecated and will not longer be supported in Lucene 4, but the documentation is not clear on what to use at its place.

Also in the first example the PorterStemmer is not available as a class (hidden) so you cannot use it directly.

Hope this helps.

Amii answered 30/12, 2011 at 16:37 Comment(1)
Giancarlo's Answer is correct with a minor change of TermAttribute to CharTermAttribute as TermAttribute is deprecated.Shane
R
3

Here is how you can use Snowball Stemmer in JAVA:

import org.tartarus.snowball.ext.EnglishStemmer;

EnglishStemmer english = new EnglishStemmer();
String[] words = tokenizer("bank banker banking");
for(int i = 0; i < words.length; i++){
        english.setCurrent(words[i]);
        english.stem();
        System.out.println(english.getCurrent());
}
Rectangular answered 13/8, 2014 at 18:53 Comment(0)
O
0

Ling pipe provides a number of tokenizers . They can be used for stemming and stop word removal . Its a simple and a effective means of stemming.

Orthicon answered 28/2, 2012 at 11:12 Comment(0)
A
0

Since the PorterStemmer is not public, we ca't call the stem function of PorterStemmer.

Instead we can KStemmer/KStemFilter to stemming the words to its root word.

Below is the scala code snippet which accepts the string and transforms to stemmed string

import org.apache.lucene.analysis.core.WhitespaceTokenizer import org.apache.lucene.analysis.en.KStemFilter

import java.io.StringReader

object Stemmer { def stem(input:String):String={

val stemmed_string = new StringBuilder()

val inputReader = new StringReader(input.toLowerCase)

val whitespaceTokenizer = new WhitespaceTokenizer()
whitespaceTokenizer.setReader(inputReader)

val kStemmedTokenStream = new KStemFilter(whitespaceTokenizer)
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute

val charTermAttribute = kStemmedTokenStream.addAttribute(classOf[CharTermAttribute])

kStemmedTokenStream.reset
while (kStemmedTokenStream.incrementToken) {
  val term = charTermAttribute.toString
  stemmed_string.append(term+" ")
}
stemmed_string.toString().trim.toUpperCase

}

}

Ammerman answered 26/5, 2021 at 5:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.