How to use a Lucene Analyzer to tokenize a String?
Asked Answered
S

4

40

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String?

Something like:

String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...);
List<String> tokenized_string = analyzer.analyze(to_be_parsed);
Semela answered 13/6, 2011 at 18:38 Comment(2)
That's a pretty vague question you're asking. The answer is "Yes". But it depends a lot on how you want to parse/tokenize said string.Halfbaked
@Halfbaked added an example. I used List<String> but it doesn't have to be necessarly a List<String>.Semela
H
41

As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):

public final class LuceneUtils {

    public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {

        List<String> result = new ArrayList<String>();
        TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));

        try {
            while(stream.incrementToken()) {
                result.add(stream.getAttribute(TermAttribute.class).term());
            }
        }
        catch(IOException e) {
            // not thrown b/c we're using a string reader...
        }

        return result;
    }  
}
Halfbaked answered 13/6, 2011 at 19:11 Comment(1)
Just one more note: As of Lucene 3.2 TermAttribute is deprecated in favor of CharTermAttribute.Semela
C
58

Based off of the answer above, this is slightly modified to work with Lucene 4.0.

public final class LuceneUtil {

  private LuceneUtil() {}

  public static List<String> tokenizeString(Analyzer analyzer, String string) {
    List<String> result = new ArrayList<String>();
    try {
      TokenStream stream  = analyzer.tokenStream(null, new StringReader(string));
      stream.reset();
      while (stream.incrementToken()) {
        result.add(stream.getAttribute(CharTermAttribute.class).toString());
      }
    } catch (IOException e) {
      // not thrown b/c we're using a string reader...
      throw new RuntimeException(e);
    }
    return result;
  }

}
Chrysolite answered 5/3, 2012 at 7:3 Comment(3)
In Lucene 4.1 you also need to add stream.reset() before the while statementConjoined
You may want to add a stream.end(); stream.close(); after the while slope.Coagulant
Note; above works perfectly in Lucene 7.0.1. Just add sugar with try-with-resource on TokenStream.Exterminatory
H
41

As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):

public final class LuceneUtils {

    public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {

        List<String> result = new ArrayList<String>();
        TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));

        try {
            while(stream.incrementToken()) {
                result.add(stream.getAttribute(TermAttribute.class).term());
            }
        }
        catch(IOException e) {
            // not thrown b/c we're using a string reader...
        }

        return result;
    }  
}
Halfbaked answered 13/6, 2011 at 19:11 Comment(1)
Just one more note: As of Lucene 3.2 TermAttribute is deprecated in favor of CharTermAttribute.Semela
H
3

The latest best practices, as another Stack Overflow answer indicates, seems to be to add an attribute to the token stream and later access that attribute, rather than getting an attribute directly from the token stream. And for good measure you can make sure the analyzer gets closed. Using the very latest Lucene (currently v8.6.2) the code would look like this:

String text = "foo bar";
String fieldName = "myField";
List<String> tokens = new ArrayList();
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream(fieldName, text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      tokens.add(charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

After that code is finished, tokens will contain a list of parsed tokens.

See also: Lucene Analysis Overview.

Caveat: I'm just starting to write Lucene code, so I don't have a lot of Lucene experience. I have taken the time to research the latest documentation and related posts, however, and I believe that the code I've placed here follows the latest recommended practices slightly better than the current answers.

Huoh answered 20/9, 2020 at 19:5 Comment(0)
C
2

Even better by using try-with-resources! This way you don't have to explicitly call .close() that is required in higher versions of the library.

public static List<String> tokenizeString(Analyzer analyzer, String string) {
  List<String> tokens = new ArrayList<>();
  try (TokenStream tokenStream  = analyzer.tokenStream(null, new StringReader(string))) {
    tokenStream.reset();  // required
    while (tokenStream.incrementToken()) {
      tokens.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
    new RuntimeException(e);  // Shouldn't happen...
  }
  return tokens;
}

And the Tokenizer version:

  try (Tokenizer standardTokenizer = new HMMChineseTokenizer()) {
    standardTokenizer.setReader(new StringReader("我说汉语说得很好"));
    standardTokenizer.reset();
    while(standardTokenizer.incrementToken()) {
      standardTokenizer.getAttribute(CharTermAttribute.class).toString());
    }
  } catch (IOException e) {
      new RuntimeException(e);  // Shouldn't happen...
  }
Cayla answered 7/2, 2019 at 17:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.