My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?
My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");
Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?
Thanks in advance!
Amaç