How can I improve the performance of NLTK? alternatives?

I tried searching for it here and there, but could not find any good solution, so though of asking nlp experts. I am developing an text similarity finding application for which I need to match thousands and thousands of documents (of around 1000 words each) with each other. For nlp part, my best bet is NLTK (seeing its capabilities and algorithm friendlyness of python.But now when parts of speech tagging in itself taking so much of time, I believe, nltk may not be best suitable. Java or C won't hurt me, hence any solution will work for me. Please note, I have already started migrating from mysql to hbase in order to work with more freedom on such large number of data. But still question exists, how to perform algos. Mahout may be a choice, but that too is for machine learning, not dedicated for nlp (may be good for speech recognition). What else are available options. In gist, I need high performance nlp, (a step down from high performance machine learning). (I am inclined a bit towards Mahout, seeing future usage).

It is about scaling nltk.

You can use Mahout to find which documents are most related to one another.

Here is a quick tutorial (link) that will teach you some of the concepts, but they are best explained in chapter 8 in the Mahout in Action book.

Basically, you need first to represent your data in Hadoop SequenceFile format, for which you can use the seqdirectory command, but this might prove too slow, given that it will want each document as its own file (so if you have "thousands and thousands of documents" I/O will suffer.) This post is related in the sense that it talks about how to make a SequenceFile from a CSV file, where each line is a document. Although, if I'm not mistaken, Mahout's trunk may have some functionality for this. You might want to ask in the Mahout user mailing list.

Then, after your documents are represented in Hadoop SequenceFile format, you need to apply the seq2sparse command. The full list of available command line options are on chapter 8 of the book, but you can poke the command for its help prompt and it will give you a list of commands. One of the commands you will need is -a which is the class name of the (lucene?) text analyzer you want to use, here is where you get rid of stop words, word stemming, remove punctuation, etc... The default analyzer is org.apache.lucene.analysis.standard.StandardAnalyzer.

Then you represent your data as a matrix with the rowid command.

After, you use the rowsimilarity command to get most similar documents.

Hope this helps.

Recommended topics

Hot tags