I tried searching for it here and there, but could not find any good solution, so though of asking nlp experts. I am developing an text similarity finding application for which I need to match thousands and thousands of documents (of around 1000 words each) with each other. For nlp part, my best bet is NLTK (seeing its capabilities and algorithm friendlyness of python.But now when parts of speech tagging in itself taking so much of time, I believe, nltk may not be best suitable. Java or C won't hurt me, hence any solution will work for me. Please note, I have already started migrating from mysql to hbase in order to work with more freedom on such large number of data. But still question exists, how to perform algos. Mahout may be a choice, but that too is for machine learning, not dedicated for nlp (may be good for speech recognition). What else are available options. In gist, I need high performance nlp, (a step down from high performance machine learning). (I am inclined a bit towards Mahout, seeing future usage).
It is about scaling nltk.