There are two variations in the OP question:
- What is "the process to obtain Tokens from a TokenStream"?
- "Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token
say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream
says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream
in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token
if you really want/need that type?
With the same API change that makes TokenStream
an AttributeSource
, Token
now implements Attribute
and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute
and OffsetAttribute
. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token
while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken()
will change the state of the Token
returned from addAttribute
; So if your goal is to build a collection of different Token
objects to be used outside the loop then you will need to do extra work to make a new Token
object as a (deep?) copy.
CharTermAttributeImpl.toString()
instead – Melisent