How to get a Token from a Lucene TokenStream?
Asked Answered
M

4

78

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.

The worst part is that I'm looking at the comments in the JavaDocs that address my question.

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.

Can anyone explain how to get token-like information from a TokenStream?

Marroquin answered 14/4, 2010 at 14:30 Comment(0)
D
120

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = termAttribute.term();
}

Edit: The new way

According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

tokenStream.reset();
while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = charTermAttribute.toString();
}
Dip answered 14/4, 2010 at 14:37 Comment(8)
Now TermAttribute is depricated. As I can see we can use something like CharTermAttributeImpl.toString() insteadMelisent
You should use addAttribute rather than getAttribute. From lucene javadocs: "It is recommended to always use addAttribute(java.lang.Class) even in consumers of TokenStreams, because you cannot know if a specific TokenStream really uses a specific Attribute" lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/…Reconcilable
@jpountz: Thanks for the tip! I have modified the answer accordingly.Dip
Had to call reset() with Lucene 4.3 so took the liberty of adding itDoucette
Finally, I don't see the answer on the post question: "How to get a Token from a Lucene TokenStream?"Consort
@serhio: I added a supplementary answer that hopefully addresses your concernProminent
You are missing tokenStream.end() and tokenStream.close() required by the TokenStream workflow.Seraphine
This code will skip the first term , how to print the first termZampino
M
43

This is how it should be (a clean version of Adam's answer):

TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
  System.out.println(cattr.toString());
}
stream.end();
stream.close();
Mahogany answered 23/9, 2012 at 16:42 Comment(4)
Your code did not function properly until I added a stream.reset() before the while loop. I am using Lucene 4.0, so that may be a recent change. Refer to the example near the bottom of this page: lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/…Illusory
Tried to edit to add the reset() call, which avoids an NPE inside Lucene at incrementToken(), but all but one peer rejected the edit as incorrect. The Lucene docs explictly say that "The consumer calls reset()" prior to "The consumer calls incrementToken()" in the TokenStream APIProminent
Also had to call reset() with Lucene 4.3 so I took the liberty of adding itDoucette
maybe the question is odd, but, finally, is not very clear how to obtain the next Token (not the next string)?Consort
T
4

For the latest version of lucene 7.3.1

    // Test the tokenizer
    Analyzer testAnalyzer = new CJKAnalyzer();
    String testText = "Test Tokenizer";
    TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    try {
        ts.reset(); // Resets this stream to the beginning. (Required)
        while (ts.incrementToken()) {
            // Use AttributeSource.reflectAsString(boolean)
            // for token stream debugging.
            System.out.println("token: " + ts.reflectAsString(true));

            System.out.println("token start offset: " + offsetAtt.startOffset());
            System.out.println("  token end offset: " + offsetAtt.endOffset());
        }
        ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
        ts.close(); // Release resources associated with this stream.
    }

Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html

Technetium answered 10/6, 2018 at 16:33 Comment(0)
P
1

There are two variations in the OP question:

  1. What is "the process to obtain Tokens from a TokenStream"?
  2. "Can anyone explain how to get token-like information from a TokenStream?"

Recent versions of the Lucene documentation for Token say (emphasis added):

NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.

And TokenStream says its API:

... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.

The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.

But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?

With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.

It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

Prominent answered 18/4, 2014 at 15:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.