Tokenizing Twitter Posts in Lucene

Asked 31/3, 2010 at 17:26 Answered 6/4, 2019 at 15:50

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

Anemometer answered 31/3, 2010 at 17:26 Comment(2)

what does your final solution looks like? – Interval 17/8, 2010 at 10:53

if you need a solution for solr this could help: issues.apache.org/jira/browse/SOLR-2059 and something like "# => ALPHA" "@ => ALPHA" – Interval 25/8, 2010 at 17:7

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

Badminton answered 1/4, 2010 at 6:19 Comment(3)

Thanks. I already tried implementing my own Analyzer by using WhitespaceTokenizer instead of StandardTokenizer. But that leaves host names, email addresses, and some other stuff unrecognized and tokenized erroneously. I would like to process a stream with my custom TwitterTokenizer (which handles @s and #s does nothing else) then feed the resulting stream into a StandardTokenizer and go on from there. However, as far as I understand an Analyzer can have only one Tokenizer at the beginning of the chain. – Anemometer 1/4, 2010 at 8:56

Another approach could be to use PerFieldAnalyzerWrapper and make a second pass through the content to explicitely look for hash tags and user references and put them in a separate field of your document (e.g. 'tags' and 'replies'). The analyzers for those field then only return tokens for occurences of #tag and @user respectively. – Badminton 1/4, 2010 at 9:53

The link is broken. You can now view the analyzers here. – Avarice 24/5, 2017 at 6:12

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter

This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>

Minnieminnnie answered 6/4, 2019 at 15:50 Comment(0)

There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

Cougar answered 23/11, 2013 at 2:25 Comment(0)

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

Joappa answered 1/3, 2014 at 16:21 Comment(0)

The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.

https://dev.twitter.com/docs/entities

So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?

Precautionary answered 12/3, 2014 at 8:35 Comment(0)

Twitter open source there text process lib, implements token handler for hashtag etc.

such as: HashtagExtractor https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java

It is base on lucene's TokenStream.

Skippy answered 15/12, 2017 at 7:49 Comment(0)

Recommended topics

Hot tags